ResearchGate 


See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/268805671 
Acoustic sequences in non- human animals: A tutorial review and prospectus 


Article in Biological Reviews - November 2014 


DOI: 10.1111/brv.12160 


CITATIONS READS 


79 982 


42 authors, including: 


2 Arik Kershenbaum y Daniel T Blumstein 
University of Cambridge University of California, Los Angeles 


33 PUBLICATIONS 356 CITATIONS 439 PUBLICATIONS 15,097 CITATIONS 
SEE PROFILE SEE PROFILE 
Marie A Roch Çağlar Akçay 
San Diego State University Virginia Polytechnic Institute and State University 
71 PUBLICATIONS 731 CITATIONS 36 PUBLICATIONS 721 CITATIONS 
SEE PROFILE SEE PROFILE 


Some of the authors of this publication are also working on these related projects: 


Project Developing a protocol for surveying mangrove-living-crab-eating primates using UAVs and thermal cameras View project 


Project Mechanisms and demographic consequences of dispersal in meerkats View project 


All content following this page was uploaded by Sara Waller on 10 May 2019. 


The user has requested enhancement of the downloaded file. 


BIOLOGICAL Cambridge 
REVIEWS Philosophical Society 


Biol. Rev. (2014), pp. 000—000. 1 
doi: 10.1111/brv.12160 


Acoustic sequences in non-human animals: a 
tutorial review and prospectus 


Arik Kershenbaum!?*, Daniel T. Blumstein?, Marie A. Roch?, Caglar Akcay? , 
Gregory Backus5, Mark A. Bee’, Kirsten Bohn?, Yan Cao?, Gerald Carter!®, 
Cristiane Casar!!, Michael Coen??, Stacy L. DeRuiter!*, Laurance Doyle", 
Shimon Edelman’, Ramon Ferrer-i-Cancho!®, Todd M. Freeberg!’, 

Ellen C. Garland?, Morgan Gustison!’, Heidi E. Harley??, Chloé Huetz?!, 
Melissa Hughes??, Julia Hyland Bruno?*, Amiyaal Ilany', Dezhe Z. Jin?*, 
Michael Johnson, Chenghui Ju°5, Jeremy Karnowski?’, Bernard Lohr28, 
Marta B. Manser??, Brenda McCowan??, Eduardo Mercado III*!, 

Peter M. Narins?, Alex Piel?*, Megan Rice, Roberta Salmi?9, 

Kazutoshi Sasahara®’, Laela Sayigh?5, Yu Shiu’, Charles Taylor?, Edgar E. Vallejo??, 
Sara Waller?! and Veronica Zamora-Gutierrez?! 


! National Institute for Mathematical and Biological Synthesis, University of Tennessee, 1122 Volunteer Blud., Suite 106, 
Knoxville, TN 37996-3410, U.S.A. 

? Department of Zoology, University of Cambridge, Downing Street, Cambridge, CB2 3EJ, U.K. 

3 Department of Ecology and Evolutionary Biology, University of California Los Angeles, 621 Charles E. Young Drive South, 
Los Angeles, CA 90095-1606, U.S.A. 

^ Department of Computer Science, San Diego State University, 5500 Campanile Dr, San Diego, CA 92182, U.S.A. 

5 Lab of Ornithology, Cornell University, 159 Sapsucker Woods Rd, Ithaca, NY 14850, U.S.A. 

Š Department of Biomathematics, North Carolina State University, Raleigh, NC 27607, U.S.A. 

7 Department of Ecology, Evolution and Behavior, University of Minnesota, 100 Ecology Building, 1987 Upper Buford Cir, 
Falcon Heights, MN 55108, U.S.A. 

8 School of Integrated Science and Humanity, Florida International University, Modesto Maidique Campus, 11200 SW 8th 
Street, AHC-4, 351, Miami, FL 33199, U.S.A. 

? Department of Mathematical Sciences, University of Texas at Dallas, 800 W Campbell Rd, Richardson, TX 75080, U.S.A. 
10 Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, U.S.A. 

1! Department of Psychology & Neuroscience, University of St. Andrews, St Mary's Quad South Street, St Andrews, KY16 IJP. 
U.K. 

12 Department of Biostatistics and Medical Informatics, K6/446 Clinical Sciences Center, University of Wisconsin, 

600 Highland Avenue, Madison, WI 53792-4675, U.S.A. 

13 Department of Biology School of Mathematics and Statistics, University of St. Andrews, St Andrews, KY16 98S, U.K. 

14 Carl Sagan Center for the Study of Life in the Universe, SETI Institute, 189 Bernardo Ave, Suite 100, Mountain View, 
CA 94043, U.S.A. 

15 Department of Psychology, Cornell University, 211 Uris Hall, Ithaca, NY 14853-7601, U.S.A. 

16 Department of Computer Science, Universitat Politecnica de Catalunya (Catalonia), Calle Jordi Girona, 31, 

08034 Barcelona, Spain 

17 Department of Psychology, University of Tennessee, Austin Peay Building, Knoxville, TN 37996, U.S.A. 

18 National Marine Mammal Laboratory, AFSC/NOAA, 7600 Sand Point Way N.E., Seattle, WA 98115, U.S.A. 

19 Department of Psychology, University of Michigan, 530 Church St, Ann Arbor, MI 48109, U.S.A. 

20 Division of Social Sciences, New College of Florida, 5800 Bay Shore Rd, Sarasota, FL 34243, U.S.A. 

21 CNPS, CNRS UMR 8195, Université Paris-Sud, UMR 8195, Batiments 440-447, Rue Claude Bernard, 91405 Orsay, France 
?? Debartment of Biology, College of Charleston, 66 George St, Charleston, SC 29424, U.S.A. 


* Address for correspondence (Tel: +44-1223-3336682; E-mail: arik.kershenbaum@gmail.com). 


Biological Reviews (2014) 000—000 © 2014 Cambridge Philosophical Society 


9 A. Kershenbaum and others 


?3 Department of Psychology, Hunter College and the Graduate Center, The City University of New York, 365 Fifth Avenue, New 
York, NY 10016, U.S.A. 

24 Department of Physics, Pennsylvania State University, 104 Davey Lab, University Park, PA 16802-6300, U.S.A. 

?5 Department of Electrical and Computer Engineering, Marquette University, 1515 W. Wisconsin Ave., Milwaukee, WI 53233, 
U.S.A. 

25 Department of Biology, Queen College, The City University of New York, 65-30 Kissena Blod., Flushing, NY 11367, U.S.A. 
27 Department of Cognitive Science, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0515, U.S.A. 
38 Department of Biological Sciences, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, 
U.S.A. 

?? Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Winterthurerstrasse 190, 

CH-8057 Zurich, Switzerland 

30 Department of Veterinary Medicine, University of California Davis, 1 Peter J Shields Ave, Davis, CA 95616, U.S.A. 

31 Department of Psychology, University at Buffalo, The State University of New York, Park Hall Room 204, Buffalo, 

NY 14260-4110, U.S.A. 

32 Department of Evolution, Ecology, & Behavior, University at Buffalo, The State University of New York, Park Hall Room 204, 
Buffalo, NY 14260-4110, U.S.A. 

33 Department of Integrative Biology & Physiology, University of California Los Angeles, 612 Charles E. Young Drive East, 

Los Angeles, CA 90095-7246, U.S.A. 

34 Division of Biological Anthropology, University of Cambridge, Pembroke Street, Cambridge, CB2 3QG, U.K. 

35 Department of Psychology, California State University San Marcos, 333 S. Twin Oaks Valley Rd., San Marcos, 

CA 92096-0001, U.S.A. 

36 Department of Anthropology, University of Georgia at Athens, 355 S Jackson St, Athens, GA 30602, U.S.A. 

37 Department of Complex Systems Science, Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, 
Nagoya, 464-8601, Japan 

38 Biology Department, Woods Hole Oceanographic Institution, 266 Woods Hole Rd, Woods Hole, MA 02543-1050, U.S.A. 

39 Department of Computer Science, Monterrey Institute of Technology, Ave. Eugenio Garza Sada 2501 Sur Col. Tecnológico C.P. 
64849, Monterrey, Nuevo León Mexico 

40 Department of Philosophy, Montana State University, 2-155 Wilson Hall, Bozeman, MT 59717, U.S.A. 

41 Centre for Biodiversity and Environment Research, University College London, Gower St, London, WCIE 6BT, U.K. 


ABSTRACT 


‘acoustic units. Apart from the well-known example of birdsong, other animals such as insects, amphibians, 
and mammals (including bats, rodents, primates, and cetaceans) also generate complex acoustic sequences. 
Occasionally, such as with birdsong, the adaptive role of these sequences seems clear (e.g. mate attraction 


and territorial defence). More often however, researchers have only begun to characterise — let alone 


Our review aims to outline suitable methods 
for testing these hypotheses, and to describe the major limitations to our current and near-future knowledge 
on questions of acoustic sequences. This review and prospectus is the result of a collaborative effort between 
43 scientists from the fields of animal behaviour, ecology and evolution, signal processing, machine learning, 
quantitative linguistics, and information theory, who gathered for a 2013 workshop entitled, ‘Analysing vocal 
sequences in animals’. Our goal is to present not just a review of the state of the art, but to propose a 
methodological framework that summarises what we suggest are the best practices for research in this field, 
across taxa and across disciplines. We also provide a tutorial-style introduction to some of the most promising 
algorithmic approaches for analysing sequences. We divide our review into three sections: identifying the 
distinct units of an acoustic sequence, describing the different ways that information can be contained within a 
sequence, and analysing the structure of that sequence. Each of these sections is further subdivided to address 
the key questions and approaches in that area. We propose a uniform, systematic, and comprehensive approach 
to studying sequences, with the goal of clarifying research terms used in different fields, and facilitating 


collaboration and comparative studies. (Allowing greater interdisciplinary collaboration will facilitate the 


Key words: acoustic communication, information, information theory, machine learning, Markov model, 
meaning, network analysis, sequence analysis, vocalisation. 
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I. INTRODUCTION to be ritualised signals where the signaller benefits if 


the signal is detected and acted upon by a receiver. The 
most studied examples include birdsong, where males 
may use sequences to advertise their potential quality to 
rival males and to receptive females (Catchpole & Slater, 


2003). Acoustic sequences can contain information on 
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e.g. in marmots Marmota spp. (Blumstein, 2007), pri- 
mates (Schel, Tranquilli & Zuberbühler, 2009; Casar 
et al., 20125), and parids (Baker & Becker, 2002). In 
many cases, however, the ultimate function of communi- 
cating in sequences is unclear. Understanding the prox- 
imate and ultimate forces driving and constraining the 
evolution of acoustic sequences, as well as decoding the 
information contained within them, is a growing field 
in animal behaviour (Freeberg, Dunbar & Ord, 2012). 
New analytical techniques are uncovering characteris- 
tics shared among diverse taxa, and offer the potential 
of describing and interpreting the information within 
animal communication signals. The field is ripe for 
a review and a prospectus to guide future empirical 
research. 

Progress in this field could benefit from an approach 
that can bridge and bring together inconsistent termi- 
nology, conflicting assumptions, and different research 
goals, both between disciplines (e.g. between biologists 
and mathematicians), and also between researchers 
concentrating on different taxa (e.g. ornithologists and 
primatologists). Therefore, we aim to do more than 
provide a glossary of terms. Rather, we build a frame- 
work that identifies the key conceptual issues com- 
mon to the study of acoustic sequences of all types, 
while providing specific definitions useful for clarifying 
questions and approaches in more narrow fields. Our 
approach identifies three central questions: what are 
the units that compose the sequence? How is informa- 
tion contained within the sequence? How do we assess 
the structure governing the composition of these units? 
Figure 1 illustrates a conceptual flow diagram linking 
these questions, and their sub-components, and should 
be broadly applicable to any study involving animal 
acoustic sequences. 

Our aims in this review are as follows: (7) to identify the 
key issues and concepts necessary for the successful anal- 
ysis of animal acoustic sequences; (i) to describe the 
commonly used analytical techniques, and importantly, 
also those underused methods deserving of more atten- 
tion; (iii) to encourage a cross-disciplinary approach to 
the study of animal acoustic sequences that takes advan- 
tage of tools and examples from other fields to create 
a broader synthesis; and (?v) to facilitate the investiga- 
tion of new questions through the articulation of a solid 
conceptual framework. 

In Section II we ask why sequences are important, and 
what is meant by ‘information’ content and ‘meaning’ 
in sequences. In Section II, we examine the questions 
of what units make up a sequence and how to iden- 
tify them. In some applications the choice seems triv- 
ial, however in many study species, sequences can be 
represented at different hierarchical levels of abstrac- 
tion, and the choice of sequence ‘unit’ may depend on 
the hypotheses being tested. In Section IV, we look at 
the different ways that units can encode information in 
sequences. In Section V, we examine the structure of the 
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sequence, the mathematical and statistical models that 
quantify how units are combined, and how these mod- 
els can be analysed, compared, and assessed. In Section 
VI, we describe some of the evolutionary and ecologi- 
cal questions that can be addressed by analysing animal 
acoustic sequences, and look at some promising future 
directions and new approaches. 


II. THE CONCEPTS OF INFORMATION AND 
MEANING 


The complementary terms, ‘meaning’ and ‘informa- 
tion’ in communication, have been variously defined, 
and have long been the subject of some controversy 
(Dawkins & Krebs, 1978; Stegmann, 2013). In this 
section we explore some of the different definitions 
from different fields, and their significance for research 
on animal behaviour. The distinction between informa- 
tion and meaning is sometimes portrayed with infor- 
mation as the form or structure of some entity on 
the one hand, and meaning as the resulting activity 
of a receiver of that information on the other hand 
(Bohm, 1989). 


(1) Philosophy of meaning 


The different vocal signals of a species are typically 
thought to vary in ways associated with factors that 
are primarily internal (hormonal, motivational, emo- 
tional), behavioural (movement, affiliation, agonistic), 
external (location, resource and threat detection), or 
combinations of such factors. Much of the variation 
in vocal signal structure and signal use relates to what 
W. John Smith called the message of the signal - the 
‘kinds of information that displays enable their users 
to share’ (Smith, 1977, p. 70). Messages of signals are 
typically only understandable to us as researchers after 
considerable observational effort aimed at determining 
the extent of association between signal structure and 
use, and the factors mentioned above. The receiver of 
a signal gains information, or meaning, from the struc- 
ture and use of the signal. Depending on whether the 
interests of the receiver and the signaller are aligned 
or opposed, the receiver may benefit, or potentially 
be fooled or deceived, respectively (Searcy & Nowicki, 
2005). The meaning of a signal stems not just from the 
message or information in the signal itself, but also from 
the context in which the signal is produced. The context 
of communication involving a particular signal could 
relate to a number of features, including signaller char- 
acteristics, such as recent signals or cues it has sent, as 
well as location or physiological state, and receiver char- 
acteristics, such as current behavioural activity or recent 
experience. Context can also relate to jointsignaller and 
receiver characteristics, such as the nature of their rela- 
tionship (Smith, 1977). 
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Fig. 1. Flowchart showing a typical analysis of animal acoustic sequences. In this review, we discuss identifying units, 


characterising sequences, and identifying meaning. 


Philosophical understanding of meaning is rooted in 
studies of human language and offers a variety of schools 


of thought. As an example; we present a list of some 








(2) Context 


well. Context includes internal and external factors that 
may influence both the production and perception of 
acoustic sequences; the effects of context can partially 
be understood by considering how it specifically influ- 
ences the costs and benefits of producing a particular 
signal or responding to it. For instance, an individual's 
motivational, behavioural, or physiological state may 
influence response (Lynch etal, 2005; Goldbogen 
et al., 2013); hungry animals respond differently to 
signals than satiated ones, and an individual in oestrus 
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or musth may respond differently than ones not in 
those altered physiological states (Poole, 1999). Sex 
may influence response as well (Tyack, 1983; Darling, 
Jones & Nicklin, 2006; Smith et al., 2008; van Schaik, 
Damerius & Isler, 2013). 'The social environment may 
influence the costs and benefits of responding to a 
particular signal (Bergman et al., 2003; Wheeler, 20104; 
Ilany etal., 2011; Wheeler & Hammerschmidt, 2012) 
as might environmental attributes, such as tempera- 
ture or precipitation. Knowledge from other social 
interactions or environmental experiences can also 
play a role in context, e.g. habituation (Krebs, 1976). 
Context can also alter a behavioural response when 
hearing the same signal originate from different spatial 
locations. For instance in neighbour-stranger discrim- 
ination in songbirds, territorial males typically respond 
less aggressively toward neighbours compared with 
strangers, so long as the two signals are heard coming 
from the direction of the neighbour's territory. If both 
signals are played back from the centre of the subject’s 
territory, or from a neutral location, subjects typically 
respond equally aggressively to both neighbours and 
strangers (Falls, 1982; Stoddard, 1996). Identifying and 
testing for important contextual factors appears to be 
an essential step in decoding the meaning of sequences. 
In human language, context has been proposed to 
be either irrelevant to, or crucial to, the meaning of 
words and sentences. In some cases, a sentence bears 
the same meaning across cultures, times, and locations, 
irrespective of context, e.g. ‘2 +2 = 4' (Quine, 1960). In 
other cases, meaning is derived at least partially from 
external factors, e.g. the chemical composition of a sub- 
stance defines its nature, irrespective of how the sub- 
stance might be variously conceived by different people 
(Putnam, 1975). By contrast, indexical terms such as 
'she' gain meaning only as a function of context, such 
as physical or implied pointing gestures (Kaplan, 1978). 
Often, the effect of the signal on the receivers deter- 
mines its usefulness, and that usefulness is dependent 
upon situational-contextual forces (Millikan, 2004). 


(3) Contrasting definitions of meaning 


Biologists (particularly behavioural ecologists), and 
cognitive neuroscientists have different understandings 
of meaning. For most biologists, meaning relates to the 
function of signalling. The function of signals is exam- 
ined in agonistic and affiliative interactions, in courtship 
and mating decisions, and in communicating about 
environmental stimuli, such as the detection of preda- 
tors (Bradbury & Vehrencamp, 2011). Behavioural 
ecologists study meaning by determining the degree 
of production specificity the degree of response 
specificity, and contextual independence (e.g. Evans, 
1997). Cognitive neuroscientists generally under- 
stand meaning through mapping behaviour onto 
structure—function relationships in the brain (Chatter- 
jee, 2005). 
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Mathematicians understand meaning by developing 
theories and models to interpret the observed signals. 
This includes defining and quantifying the variables 
(observable and unobservable), and the formalism for 
combining various variables into a coherent framework, 
e.g. pattern theory (Mumford & Desolneux, 2010). One 
approach to examining a signal mathematically is to 
determine the entropy, or amount of structure (or 
lack thereof) present in a sequence. An entropy metric 
places a bound on the maximum amount of information 
that can be present in a signal, although it does not 
determine that such information is, in fact, present. 

Qualitatively, we infer meaning in a sequence if it 
modifies the receiver's response in some predictable 
way. Quantitatively, information theory measures the 
amount of information (usually in units of bits) trans- 
mitted and received within a communication system 
(Shannon et al., 1949). Therefore, information theory 
approaches can describe the complexity of the commu- 
nication system. Information theory additionally can 
characterise transmission errors and reception errors, 
and has been comprehensively reviewed in the context 
of animal communication in Bradbury & Vehrencamp 
(2011). 

The structure of acoustic signals does not necessarily 
have meaning fer se, and so measuring that structure 
does not necessarily reveal the complexity of mean- 
ing. As one example, the structure of an acoustic signal 
could be related to effective signal transmission through 
a noisy or reverberant environment. A distinction is 
often made between a signal's ‘content’, or broadcast 
information, and its ‘efficacy’, or transmitted informa- 
tion — the characteristics or features of signals that actu- 
ally reach receivers (Wiley, 1983; Hebets & Papaj, 2005). 
This is basically the distinction between bearing func- 
tional information and getting that information across 
to receivers in conditions that can be adverse to clear sig- 
nal propagation. A sequence may also contain elements 
that do not in themselves contain meaning, but are 
intended to get the listeners' attention, in anticipation 
of future meaningful elements (e.g. Richards, 1981; 
Call & Tomasello, 2007; Arnold & Zuberbühler, 2013). 

Considerable debate exists over the nature of ani- 
mal communication and the terminology used in ani- 
mal communication research (Owren, Rendall & Ryan, 
2010; Seyfarth et al., 2010; Ruxton & Schaefer, 2011; 
Stegmann, 2013), and in particular the origin of and 
relationship between meaning and information, and 
their evolutionary significance. For our purposes, we 
will use the term ‘meaning’ when discussing behavioural 
and evolutionary processes, and the term 'information' 
when discussing the mathematical and statistical proper- 
ties of sequences. This parallels (but is distinct from) the 
definitions given by Ruxton & Schaefer (2011), in par- 
ticular because we wish to have a single term ('informa- 
tion’) that describes inherent properties of sequences, 
without reference to the putative behavioural effects on 


Biological Reviews (2014) 000—000 © 2014 Cambridge Philosophical Society 


Acoustic sequences in animals 


receivers, or the ultimate evolutionary processes that 
caused the sequence to take the form that it does. 

We have so far been somewhat cavalier in how we have 
described the structures of call sequences, using terms 
like notes, units, and, indeed, calls. In the next section of 
our review, we describe in depth the notion of signalling 
*units' in the acoustic modality. 


III. ACOUSTIC UNITS 


Indeed, definitions of units, how they are identified, and 
the semantic labels we assign them vary widely across 
researchers working with different taxonomic groups 
(Gerhardt & Huber, 2002) or even within taxonomic 
groups, as illustrated by the enormous number of names 
for different units in the songs of songbird species. Our 
purpose in this section is to discuss issues surround- 
ing the various ways the acoustic units composing a 
sequence may be characterised. 

Units may be identified based on either production 
mechanisms, which focus on how the sounds are gener- 
ated by signallers, or by perceptual mechanisms, which 
focus on how the sounds are interpreted by receivers. 
How we define a unit will therefore be different if the 
biological question pertains to production mechanisms 
or perceptual mechanisms. For example, in birdsong 
even a fairly simple note may be the result of two physi- 
cal production pathways, each made on a different side 
of the syrinx (Catchpole & Slater, 2003). In practice, 
however, the details of acoustic production and percep- 
tion are often hidden from the researcher, and so the 
definition of acoustic units is often carried out on the 
basis of observed acoustic properties: see Catchpole & 
Slater (2003). It is not always clear to what extent these 
observed acoustic properties accurately represent the 
production/perceptual constraints on communication, 
and the communicative role of the sequence. Identi- 
fying units is made all the more challenging because 
acoustic units produced by animals often exhibit graded 
variation in their features (e.g. absolute frequency, dura- 
tion, rhythm or tempo, or frequency modulation), but 
most analytical methods for unit classification assume 
that units can be divided into discrete, distinct cate- 
gories (e.g. Clark, Marler & Beeman, 1987). 

How we identify units may differ depending on 
whether the biological question pertains to produc- 
tion mechanisms, perceptual mechanisms, or acousti- 
cal analyses of information content in the sequences. 
If the unit classification scheme must reflect animal 
sound production or perception, care must be taken 
to base unit identification on the appropriate features 


of a signal, and features that are biologically rele- 
vant, e.g. Clemins & Johnson (2006). In cases where 
sequences carry meaning, it is likely that they can 
be correlated with observational behaviours (possibly 
context-dependent) observed over a large number of 
trials. 


To some 
degree, this can be tested with playback trials where the 
signals are manipulated with respect to the hypothe- 
sised unit sequence (Kroodsma, 1989; Fischer, Noser & 
Hammerschmidt, 2013). 


For example, a pure tone may become harmonic or 
noisy, as the result of the animal altering its articula- 
tors (e.g. lips), without ceasing sound production in the 
source (e.g. larynx). 


This is characteristic of pulse 
trains and ‘trills’. 





In Table 1, we give examples of the wide range 
of studies that have used these different criteria for 
dividing acoustic sequences into units. Although not 
intended to be comprehensive, the table shows how 
all of the four criteria listed above have been used 
for multiple species and with multiple aims — whether 
simply characterising the vocalisations, defining units 
of production/perception, or identifying the functional 
purpose of the sequences. 


(1) Identifying potential units 


Before we discuss in more detail how acoustic units 
may be identified in terms of production, perception, 
and analysis methods, we point out here that practically 
all such efforts require scientists to identify potential 
units at some early stage of their planned investi- 
gation or analysis. Two practical considerations are 
noteworthy. 
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hierarchical in nature, e.g. humpback whale Megaptera 
novaengliae song, reviewed in Cholewiak, Sousa-Lima & 
Cerchio (2012), distinct sequences of units may them- 
selves be organised into longer, distinctive sequences, 
i.e. ‘sequences of sequences’ (Berwick etal., 2011). 
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a potential unit can be performed either manually 
(i.e. examining the spectrograms ‘by eye’), or automati- 
cally by using algorithms for either supervised classifica- 
tion (where sounds are placed in categories according 
to pre-defined exemplars) or unsupervised clustering 
(where labelling units is performed without prior knowl- 
edge ofthe types of units that occur). We return to these 
analytical methods in Section III.4, and elaborate here 
on spectrographic representations. 

Spectrograms (consisting of discrete Fourier trans- 
forms of short, frequently overlapped, segments of the 
signal) are ubiquitous and characterise well those acous- 
tic features related to spectral profile and frequency 
modulation, many of which are relevantin animal acous- 
tic communication. Examples of such features include 
minimum and maximum fundamental frequency, slope 
of the fundamental frequency, number of inflection 
points, and the presence of harmonics (Oswald et al., 
2007) that vary, for example, between individuals (Buck 
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Fig. 3. Example of cepstral processing of a grey wolf Canis lupis howl (below 6 KHz) and crickets chirping (above 6.5 kHz). 
Recording was sampled at F, = 16 kHz, 8 bit quantization. (A) Standard spectrogram analysed with a 15 ms Blackman-Harris 
window. (B) Plot of transform to cepstral domain. Lower quefrencies are related to vocal tract information. Fy can be 
determined from the ‘cepstral bump’ apparent between quefrencies 25—45 and can be derived by F,/quefrency. (C) 
Cepstrum (inset) of the frame indicated by an arrow in (A) (2.5s) along with reconstructions of the spectrum created 
from truncated cepstral sequences. Fidelity improves as the number of cepstra are increased. 


& Tyack, 1993; Blumstein & Munos, 2005; Koren & 
Geffen, 2011; Ji etal., 2013; Kershenbaum, Sayigh & 
Janik, 2013; Root-Gutteridge et al., 2014), and in differ- 
ent environmental and behavioural contexts (Matthews 
et al., 1999; Taylor, Reby & McComb, 2008; Henderson, 
Hildebrand & Smith, 2011). 





Dis- 


carding coefficients can yield a compact representation 
of the spectrum (Fig. 3). Further, while Fourier trans- 
forms have uniform temporal and frequency resolution, 
other techniques vary this resolution by using different 
basis sets, and this provides improved frequency resolu- 
tion at low frequencies and better temporal resolution at 
higher frequencies. Examples of these other techniques 


include multi-taper spectra (Thomson, 1982; Tcherni- 
chovski et al., 2000; Baker & Logue, 2003), Wigner- Ville 
spectra (Martin & Flandrin, 1985; Cohn, 1995), and 
wavelet analysis (Mallat, 1999). While spectrograms and 
cepstra are useful for examining frequency-related fea- 
tures of signals, they are less useful when analysing 
temporal patterns of amplitude modulation. This is an 
important issue worth bearing in mind, because ampli- 
tude modulations are probably critical in signal percep- 
tion by many animals (Henry etal., 2011), including 
speech perception by humans (Remez et al., 1994). 


(2) Identifying production units 


One important approach to identifying acoustic units 
stems from considering the mechanisms for sound pro- 
duction. In stridulating insects, for example, relatively 
simple, repeated sounds are typically generated by mus- 
culature action that causes hard physical structures to 
be engaged, such as the file and scraper located on 
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the wings of crickets or the tymbal organs of cicadas 
(Gerhardt & Huber, 2002). The resulting units, var- 
iously termed ‘chirps,’ or, ‘pulses,’ can be organised 
into longer temporal sequences often termed ‘trills’ 
or ‘echemes’ (Ragge & Reynolds, 1988). Frogs can 
produce sounds with temporally structured units in a 
variety of ways (Martin, 1972; Martin & Gans, 1972; Ger 
hardt & Huber, 2002). In some species, a single acous- 
tic unit (sometimes called a ‘pulse,’ ‘note,’ or a ‘call’) 
is produced by a single contraction of the trunk and 
laryngeal musculature that induces vibrations in the 
vocal folds (e.g. Girgenrath & Marsh, 1997). In other 
instances, frogs can generate short sequences of dis- 
tinct sound units (also often called ‘pulses’) produced 
by the passive expulsion of air forced through the lar- 
ynx that induces vibrations in structures called arytenoid 
cartilages, which impose temporal structure on sound 
(Martin, 1972; Martin & Gans, 1972). Many frogs organ- 
ise these units into trills (e.g. Gerhardt, 2001), while 
other species combine acoustically disünct units (e.g. 
Narins, Lewis & McClelland, 2000; Larson, 2004). In 
songbirds, coordinated control of the two sides of the 
syrinx can be used to produce different units of sound, 
or ‘notes’ (Suthers, 2004). These units can be organ- 
ised into longer sequences, of ‘notes,’ 'trills,' ‘syllables,’ 
'phrases, 'motifs, and 'songs' (Catchpole & Slater, 
2003). In most mammals, sounds are produced as an air 
source (pressure squeezed from the lungs) causes vibra- 
tions in the vocal membranes, which are then filtered 
by a vocal tract (Titze, 1994). When resonances occur 
in the vocal tract, certain frequencies known as for- 
mants are reinforced. Formants and formant transitions 
have been strongly implicated in human perception of 
vowels and voiced consonants, and may also be used 
by other species to perceive information (Peterson & 
Barney, 1952; Raemaekers, Raemaekers & Haimoff, 
1984; Fitch, 2000). 

As the variety in these examples illustrates, there is 
incredible diversity in the mechanisms animals use to 
produce the acoustic units that are subsequently organ- 
ised into sequences. Moreover, there are additional 
mechanisms that constrain the production of some of 
the units. For example, in zebra finches Taeniopygia 
gullata, songs can be interrupted between some of its 
constitutive units but not others (Cynx, 1990). This 
suggests that at a neuronal level, certain units share 
a common, integrated neural production mechanism. 
Such examples indicate that identifying units based 
on metrics of audition or visual inspection of spectro- 
grams (e.g. based on silent gaps) may not always be 
justified, and that there may be essential utility that 
emerges from a fundamental understanding of unit pro- 
duction. Thus, a key consideration in identifying func- 
tional units of production is that doing so may often 
require knowledge about production mechanisms that 
can only come about through rigorous experimental 
studies. 
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(3) Identifying perceptual units 


While there may be fundamental insights gained from 
identifying units based on a detailed understanding of 
sound production, there may not always be a one-to-one 
mapping of the units of production or the units identi- 
fied in acoustics analyses, onto units of perception (e.g. 


= 
c 
B 
z 
T 
B 


- 
5 
° 
H 
° 
° 5 
ñ 
[2] 
o 
° 
z 
ge 
Lm 
E 
ga 
= 
[e 
Ë 
° 
5 
n 
i 
= 
= 
[9] 
un 
° 
b 
= 
a 
= 
° 
= 
n 


, 


un 
© 
Ke] 
[Si 
wa 


to consider vocalisations and other sounds as auditory 
objects (Miller & Cohen, 2010). While the rules gov- 
erning auditory object formation in humans have been 
well studied (Griffiths & Warren, 2004; Bizley & Cohen, 
2013), the question of precisely how, and to what extent, 
non-humans group acoustic information into coher- 
ent perceptual representations remains a largely open 
empirical question (Hulse, 2002; Bee & Micheyl, 2008; 
Miller & Bee, 2012). 


Thus, in instances where there 
are few discrete differences in production mechanisms 
or in spectrograms, receivers might still perceive dis- 
tinct units (Nelson & Marler, 1989; Baugh, Akre & Ryan, 
2008). 


The perceived pitch is related to 
the repetition rate, the faster the repetition, the higher 
the pitch. Given the perceptual limits of gap detection 
(Recanzone & Sutter, 2008), some silent gaps between 
units of production may be too short to be perceived by 
the receiver. Clearly, while it may sometimes be desirable 
or convenient to use ‘silence’ as a way to create analysis 
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(A) Perceptual binding. Two discrete acoustic elements may be perceived 
by the receiver either as a single element, or as two separate ones 











(B) Categorical perception. Continuous variation in acoustic signals may 
be interpreted by the receiver as discrete categories 
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(C) Spectrotemporal constraints. If the receiver cannot distinguish small 
differences in time or frequency, discrete elements may be interpreted as joined 


Fig. 4. Perceptual constraints for the definition of sequence units. (A) Perceptual binding, where two discrete acoustic 
elements may be perceived by the receiver either as a single element, or as two separate ones. (B) Categorical 
perception, where continuous variation in acoustic signals may be interpreted by the receiver as discrete categories. (C) 
Spectrotemporal constraints, where if the receiver cannot distinguish small differences in time or frequency, discrete 


elements may be interpreted as joined. 


boundaries between units, a receiver may not always 
perceive the silent gaps that we see in our spectrograms. 
Likewise, some transitions in frequency may reflect 
units of production that are not perceived because the 
changes remain unresolved by auditory filters (Moore & 
Moore, 2003; Recanzone & Sutter, 2008). Indeed, some 
species may be forced to trade off temporal and spectral 
resolution to optimise signalling efficiency in different 
environmental conditions. Frequency modulated sig- 
nals are more reliable than amplitude modulation in 
reverberant habitats, such as forests, so woodland birds 
are adapted to greater frequency resolution and poorer 
temporal resolution, while the reverse is true of grass- 
land species (Henry & Lucas, 2010; Henry et al., 2011). 


animal itself. There simply is no convenient shortcut to 


identifying perceptual units. Experimental approaches 
ranging from operant conditioning (e.g. Dooling et al., 
1987; Brown, Dooling & O'Grady, 1988; Dent etal., 
1997; Tu, Smith & Dooling, 2011; Ohms et al., 2012; Tu 
& Dooling, 2012), to field playback experiments, often 
involving the habituation-discrimination paradigm (e.g. 
Nelson & Marler, 1989; Wyttenbach, May & Hoy, 1996; 
Evans, 1997; Searcy, Nowicki & Peters, 1999; Ghazan- 
far et al., 2001; Weiss & Hauser, 2002). Such approaches 
have the potential to identify the boundaries of per- 
ceptual units. Playbacks additionally can determine 
whether units can be discriminated (as in 'go no-go' 
tasks stemming from operant conditioning), or whether 
they can be recognised and are functionally meaningful 
to receivers. 

Obviously some animals and systems are more 
tractable than others when it comes to assessing units 
of perception experimentally, but those not easy to 
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manipulate experimentally (e.g. baleen whales, Bal- 
aenopteridae) should not necessarily be excluded 
from communication sequence research, although the 
inevitable constraints must be recognised. 


(4) Identifying analytical units 





We briefly discuss methods by which scientists can iden- 
tify and validate units for sequence analyses from acous- 
tic recordings. 

Sounds are typically assigned classifications to units 
based on the consistency of acoustic characteristics. 
When feasible, external validation of categories (i.e. 
comparing animal behavioural responses to playback 
experiments) should be performed. Even without 
directly testing hypotheses of biological significance by 
playback experiment, there may be other indicators of 
the validity of a classification scheme based purely on 
acoustic similarity. For example, naive human observers 
correctly divide dolphin signature whistles into groups 
corresponding closely to the individuals that produced 
them (Sayigh et aL, 2007), and similar (but poorer) 
results are achieved using quantitative measures of 
spectrogram features (Kershenbaum et al., 2013). 


Per- 
ceptual bias occurs either when the characteristics of 
the sound that are used to make the unit assignment 
are inappropriate for the communication system being 
studied, or when the classification scheme relies too 
heavily on those acoustic features that appear important 
to human observers. For example, analysing spectro- 
grams with a 50 Hz spectral resolution would be appro- 
priate for human speech, but not for Asian elephants 
Elephas maximus, which produce infrasonic calls that are 
typically between 14 and 24Hz (Payne, Langbauer & 
Thomas, 1986), as details of the elephant calls would 
be unobservable. Features that appear important to 
human observers may include tonal modulation shapes, 
often posed in terms of geometric descriptors, such 
as ‘upsweep’, ‘concave’, and ‘sine’ (e.g. Bazua-Duran 
& Au, 2002), which are prominent to the human eye, 
but may or may not be of biological relevance. Poor 
repeatability, or variance, can occur both in human clas- 
sification, as inter-observer variability, and in machine 
learning, where computer classification algorithms can 
make markedly different decisions after training with 
different sets of data that are very similar (overtraining). 


= 
-I 


(a) Visual classification, ‘by eye’ 


Traditionally, units are ‘hand-scored’ by humans search- 
ing for consistent patterns in spectrograms (or even 
listening to sound recordings without the aid of a 
spectrogram). Visual classification has been an effective 
technique that has led to many important advances in 
the study both of birdsong (e.g. Kroodsma, 1985; Podos 
et al., 1992; reviewed in Catchpole & Slater, 2003), and 
acoustic sequences in other taxa (e.g. Narins etal., 
2000; Larson, 2004). 





However, drawbacks to visual classification exist (Clark 
et al., 1987). Visual classification is time consuming and 
prevents taking full advantage of large acoustic data sets 
generated by automated recorders. Similarly, the diffi- 
culty in scoring large data sets means that sample sizes 
used in research may be too small to draw firm conclu- 
sions (Kershenbaum, 2013). Furthermore, visual classifi- 
cation can be prone to subjective errors (Jones, ten Cate 
& Bijleveld, 2001), and inter-observer reliability should 
be used (and reported) as a measure of the robustness 
of the visual assessments (Burghardt et al., 2012). 


(b) Classification of manually extracted metrics 





A vari- 
ety of time (e.g. duration, pulse repetition rate) and 
frequency (e.g. minimum, maximum, start, end, and 
range) components can be measured (extracted) from 
spectrograms, using varying degrees of automation, or 
computer assistance for a manual operator. Software 
tools such as Sound Analysis Pro (Tchernichovski et al., 
2000), Raven (Charif, Ponirakis & Krein, 2006), and Avi- 
soft (Specht, 2004) have been developed to assist with 
this task. Metrics are then used in classification anal- 
yses to identify units, using mathematical techniques 
such as discriminant function analysis (DFA), princi- 
pal components analysis (PCA), or classification and 
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regression trees (CART), and these have been applied 
to many mammalian and avian taxa (e.g. Derégnau- 
court et al, 2005; Dunlop et al., 2007; Garland et al., 
2012; Grieves, Logue & Quinn, 2014). Feature extrac- 
tion can be conducted using various levels of automa- 
tion. A human analyst may note specific features for 
each call, an analyst-guided algorithm can be employed 
(where sounds are identified by the analyst placing a 
bounding box around the call, followed by automatic 
extraction of a specific number of features), or the pro- 
cess of extraction can be fully automated. Automated 
techniques can be used to find regions of possible calls 
that are then verified and corrected by a human analyst 
(Helble et al., 2012). 


(c) Fully automatic metric extraction and classification 


However, current imple- 
mentations generally fall short of the performance 
desired (Janik, 1999), 







However, once an automatic algorithm is 
defined, large data sets can be analysed. Machine assis- 
tance can allow analysts to process much larger data sets 
than before, but at the risk of possibly missing calls that 
they might have been able to detect. 


(et al., 2000). Each of these methods provide analysis of 
the spectral content of a short segment of the acous- 
tic production, and algorithms frequently examine how 
these parameters are distributed or change over time 
(e.g. Kogan & Margoliash, 1998). 


(d) Classification algorithms 
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(Duda et al., 2012). In both cases, the biological rele- 


vance of units must be verified independently because 
mis-specification of units can obscure sequential pat- 
terns. Environmental noise or sounds from other 
species may be mistakenly classified as an acoustic unit, 
and genuine units may be assigned to incorrect unit cat- 
egories. When using supervised algorithms, perceptual 
bias may lead to misinterpreting data when the critical 
bands, temporal resolution, and hearing capabilities of 
a species are not taken into account. For instance, the 
exemplars themselves used in supervised clustering may 
be subject to similar subjective errors that can occur 
in visual classification. However, validation of unsuper- 
vised clustering into units is also problematic, where 
clustering results cannot be assessed against known unit 
categories. The interplay between unit identification 
and sequence model validation is a non-trivial problem 
(e.g. Jin & Kozhevnikov, 2011). Similarly estimating 
uncertainty in unit classification and assessing how that 
uncertainty affects conclusions from a sequence analysis 
is a key part of model assessment (Duda et al., 2012). 








For fully unsu- 
pervised clustering algorithms, where the desired clas- 
sification is unknown, techniques exist to quantify the 
stability of the clustering result, as an indicator of clus- 
tering quality. Examples include ‘leave-k-out’ (Manning, 
Raghavan & Schütze, 2008), a generalisation of the 
‘leave-one-out’ cross-validation, and techniques based 
on normalised mutual information (Zhong & Ghosh, 
2005), which measure the similarity between two clus- 
tering schemes (Fred & Jain, 2005). However, it must 
be clear that cluster stability (and correspondingly, 
inter-observer reliability) is not evidence that the classi- 
fication is appropriate (i.e. matches the true, unknown, 
biologically relevant categorisation), or will remain sta- 
ble upon addition of new data (Ben-David, Von Luxburg 
& Pál, 2006). Other information theoretic tests provide 
an alternative assessment of the validity of unsupervised 
clustering results, such as checking if units follow Zipf's 
law of abbreviation, which is predicted by a universal 
principle of compression (Zipf, 1949; Ferrer-i-Cancho 
et al., 2013) or Zipf's law for word frequencies, which 
is predicted by a compromise between maximizing the 
distinctiveness of units and the cost of producing them 
(Zipf, 1949; Ferrer-i-Cancho, 2005). 


(5) Unit choice protocol 


this length. In particular, availability or otherwise of 
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Fig. 5. Graphical representation of the process of selecting an appropriate unit definition. (A) Determine what is known 
about the production mechanism of the signalling individual, from the hierarchy of production mechanisms, and their 
spectrotemporal differences. (B) Determine what is known about the perception abilities of the receiver (vertical axis), 
and how this may modify the production characteristics of the sound (horizontal axis). (C) Choose a classification method 
suitable for the modified acoustic characteristics ( indicates suitable, x indicates unsuitable, ~ indicates neutral). 


behavioural information, such as the responses of 
individuals to playback experiments, is often the deter- 
mining factor in deciding how to define a sequence 
unit. However, we provide here a brief protocol that can 
be used in conjunction with such prior information, or 
in its absence, to guide the researcher in choosing the 
definition of a unit. This protocol is also represented 
graphically in Fig. 5.41) Determine Whatisknown about 
the production mechanism of the signalling individual. 
For example, Fig. 5A lists eight possible production. 
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IV. INFORMATION-EMBEDDING PARADIGMS 


A ‘sequence’ can be defined as an ordered list of units. 
Animals produce sequences of sounds through a wide 
range of mechanisms (e.g. vocalisation, stridulation, 
percussion), and different uses of the sound-producing 
apparatus can produce different sound ‘units with 
distinct and distinguishable properties. The resulting 
order of these varied sound units may or may not con- 
tain information that can be interpreted by a receiver, 
irrespective of whether or not the signaller intended 
to convey meaning. Given that a sequence must con- 
sist of more than one ‘unit’ of one or more different 
types, the delineation and definition of the unit types is 
clearly of vital importance. We have discussed this ques- 
tion at length in Section III. However, assuming that 
units have been successfully assigned short-hand labels 
(e.g. A, B, C, etc.), what different methods can be used 
to arrange these units in a sequence, in such a way that 
the sequence can contain information? 

Although it seems intuitively obvious that a sequence 
of such labels may contain information, this intuition 
arises from our own natural human dispensation to lan- 
guage and writing, and may not be particularly useful in 
identifying information in animal sequences. We appre- 
ciate that birdsong, for instance, can be described as a 
complex combination of notes, and we may be tempted 
to compare this animal vocalisation to human music 
(Baptista & Keister, 2005; Araya-Salas, 2012; Rothenberg 
et al., 2013). An anthropocentric approach, however, is 
not likely in all cases to identify structure relevant to ani- 
mal communication. Furthermore, wide variation can 
be expected between the structure of sequences gener- 
ated by different taxa, from the pulse-based stridulation 
of insects (Gerhardt & Huber, 2002) to song in whales 
(reviewed in Cholewiak et al., 2012), and a single analyt- 
ical paradigm derived from a narrow taxonomic view is 
also likely to be inadequate. A more rigorous analysis is 
needed, one that indicates the fundamental structural 
properties of acoustic sequences, in all their diversity. 
Looking for information only, say, in the order of units 
can lead researchers to miss information encoded in 
unit timing, or pulse rate. 

Although acoustic information can be encoded 
in many different ways, we consider here only the 
encoding of information via sequences. We suggest a 
classification scheme based on six distinct paradigms 
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for encoding information in sequences (Fig. 6): (a) 
Repetition, where a single unit is repeated more than 
once; (b) Diversity, where information is represented 
by the number of distinct units present; (c) Combina- 
tion, where sets of units have different information 
from each unit individually; (d) Ordering, where the 
relative position of units to each other is important; 
(e) Overlapping, where information is conveyed in the 
relationship between sequences of two or more individ- 
uals; and (f) Timing, where the time gap between units 
conveys information. This framework can form the 
basis of much research into sequences, and provides 
a useful and comprehensive approach for classifying 
information-bearing sequences. We recommend that in 
any research into animal acoustic communication with 
a sequential component, researchers first identify the 
place(s) of their focal system in this framework, and 
use this structure to guide the formulation of useful, 
testable hypotheses. Identification of the place for one's 
study system will stem in part from the nature of the 
system — a call system comprising a single, highly stereo- 
typed contact note will likely fit neatly into the Repetition 
and Timing schemes we discuss, but may have little or 
nothing to do with the other schemes. We believe that 
our proposed framework will go beyond this, however, 
to drive researchers to consider additional schemes for 
their systems of study. For example, birdsong playback 
studies have long revealed that Diversity and. Repetition 
often influence the behaviour of potential conspecific 
competitors and mates (Searcy & Nowicki, 2005). Much 
less is known about the possibility that Ordering, Over- 
lapping, or Timing affect songbird receiver behaviour, 
largely because researchers simply have yet to assess 
that possibility in most systems. Considering the formal 
structures of possible information-embedding systems 
may provide supportive insights into the cognitive and 
evolutionary processes taking place (Chatterjee, 2005; 
Seyfarth, Cheney & Bergman, 2005). Of course, any par- 
ticular system might have properties of more than one of 
the six paradigms in this framework, and the boundaries 
between them may not always be clearly distinguished. 
Sperm whale Physeter macrocephalus coda exchanges 
(Watkins & Schevill, 1977) provide an example of this. 
A coda is a sequence of clicks (Repetition of the acoustic 
unit) where the Timing between echolocation clicks 
moderates response. In duet behaviour, Overlap also 
exists, with one animal producing and another respond- 
ing with another coda (Schulz et al., 2008). Each of these 
paradigms is now described in more detail below. 


(1) Repetition 


Sequences are made of repetitions of discrete units, and 
repetitions of the same unit affect receiver responses. 
For instance, the information contained in a unit A 
given in isolation may convey a different meaning to a 
receiver than an iterated sequence of unit A (e.g. AAAA, 
etc.). For example, greater numbers of D notes in the 
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(A) Repetition 


(B) Diversity 


(C) Combination 


(D) Ordering 


(E) Overlapping 


21 











(F) Timing 
Type Criterion Example 
A Repetition Single unit repeated more than once Chickadee D-note mobbing call 
(Baker & Becker, 2002) 
B Diversity A number of distinct units are present. Order Birdsong repertoire size (Searcy, 


is unimportant. 


1992) 





C Combination Set of units has different information from 
each unit individually. Order is unimportant. 


Banded mongoose close calls 
(Jansen et al., 2012) 





D Ordering 


Set of units has different information from 
each unit individually. Order is important 


Human language, Humpback 
song (Payne & McVay, 1971) 





E Overlapping 


Information conveyed in the relationship 
between sequences of two or more individuals 


Sperm whale codas (Schulz et al, 
2008) 





F Timing 


Timing between units (often between different Group alarm calling (Thompson 
individuals) conveys information 


& Hare, 2010) 


Fig. 6. (A—F) Different ways that units can be combined to encode information in a sequence. 


chick-a-dee calls of chickadee species Poecile spp. can be 
related to the immediacy of threat posed by a detected 
predator (Krams et al., 2012). Repetition in alarm calls 
is related to situation urgency in meerkats Suricata suri- 
catta (Manser, 2001), marmots Marmota spp. (Blumstein, 
2007), colobus monkeys Colobus spp. (Schel, Candiotti 
& Zuberbühler, 2010), Campbell's monkeys Cercopithe- 
cus campbelli (Lemasson et al., 2010) and lemurs Lemur 
catta and Varecia variegata (Macedonia, 1990). 


(2) Diversity 


Sequences of different units (e.g. A, B, C) are pro- 
duced, but those units are functionally interchangeable, 
and therefore ordering is unimportant. For instance, 
many songbirds produce songs with multiple different 
syllables. In many species, however, the particular sylla- 
bles are substitutable (e.g. Eens, Pinxten & Verheyen, 


1991; Farabaugh & Dooling, 1996; but see Lipkind 
et al., 2013), and receivers attend to the overall diver- 
sity of sounds in the songs or repertoires of signallers 
(Catchpole & Slater, 2003). Large acoustic repertoires 
have been proposed to be sexually selected in species 
such as great reed warblers Acrocephalus arundinaceus 
and common starlings Sturnus vulgaris (Eens, Pinxten 
& Verheyen, 1993; Hasselquist, Bensch & von Schantz, 
1996; Eens, 1997), in which case diversity embeds infor- 
mation (that carries meaning) on signaller quality (e.g. 
Kipper et al., 2006). Acoustic ‘diversity’ has additionally 
been proposed as a means of preventing habituation 
on the part of the receiver (Hartshorne, 1956, 1973; 
Kroodsma, 1990) as well as a means of avoiding (neuro- 
muscular) ‘exhaustion’ on the part of the sender (Lam- 
brechts & Dhondt, 1987, 1988). We do note that these 
explanations remain somewhat controversial, especially 
if the transitions between acoustic units are, indeed, 
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biologically constrained (Weary & Lemon, 1988, 1990; 
Weary et al., 1988; Weary, Lambrechts & Krebs, 1991; 
Riebel & Slater, 2003; Brumm & Slater, 2006). 


(3) Combination 


Sequences may consist of different discrete acoustic 
units (e.g. A, B, C) each of which is itself meaningful, 
and the combining of the different units conveys distinct 
information. Here, order does not matter (in contrast 
to the Ordering paradigm below) - the sequence of unit 
A followed by unit B has the same information as the 
sequence of unit B followed by unit A. For example, 
titi monkeys Callicebus nigrifrons (Càsar et al., 2013) use 
semantic alarm combinations, in which interspersing 
avian predator alarms calls (A-type) with terrestrial 
predator alarm calls (B-type) indicates the presence 
of a raptor on the ground. In this case, the number 
of calls (i.e. Repetition) also appears to influence the 
information present in each call sequence (Cäsar et al., 
2013). 


(4) Ordering 


Sequences of different discrete acoustic units (e.g. A, 
D, C) each of which is itself meaningful and the spe- 
cific order of which is meaningful. Here, order mat- 
ters - and the ordered combination of discrete units 
may result in emergent responses. For instance, A fol- 
lowed by B may elicit a different response than either 
A or B alone, or B followed by A. Examples include 
primate alarm calls which, when combined, elicit differ- 
ent responses related to the context of the predatory 
threat (Arnold & Zuberbühler, 20065, 2008). Human 
languages are a sophisticated example of ordered infor- 
mation encoding (Hauser Chomsky & Fitch, 2002). 
When sequences have complex ordering, simple quan- 
titative measures are unlikely to capture the ordering 
information. Indeed, the Kolmogorov complexity of a 
sequence indicates how large a descriptor is required to 
specify the sequence adequately (Denker & Woyczynski, 
1998). Instead of quantifying individual sequences, an 
alternative approach to measuring ordering is to calcu- 
late the pairwise similarity or difference between two 
sequences, using techniques such as the Levenshtein or 
Edit distance (Garland et al., 2012; Kershenbaum et al., 
2012). 


(5) Overlapping 


Sequences are combined from two or more individuals 
into exchanges for which the order of these overlapping 
sequences has information distinct from each signaller’s 
signals in isolation. Overlapping can be in the time 
dimension (i.e. two signals emitted at the same time) 
or in acoustic space, e.g. song-type matching (Krebs, 
Ashcroft & Orsdol, 1981), and frequency matching 
(Mennill & Ratcliffe, 2004). For example, in different 
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parid species (Paridae: chickadees, tits, and titmice), 
females seem to attend to the degree to which their 
males’ songs are overlapped (in time) by neighbouring 
males’ songs, and seek extra-pair copulations when 
their mate is overlapped (Otter etal., 1999; Mennill, 
Ratcliffe & Boag, 2002). Overlapping is also used for 
social bonding, spatial perception, and reunion, such 
as chorus howls in wolves (Harrington eft al., 2003) and 
sperm whale codas (Schulz et al., 2008). Overlapping 
as song-type matching (overlapping in acoustic space) 
is also an aggressive signal in some songbirds (Akcay 
et al., 2013), although this may depend on whether it is 
the sequence or the individual unit that is overlapped 
(Searcy & Beecher, 2011). Coordination between the 
calling of individuals can also give identity cues (Carter 
et al., 2008). However, despite the apparent widespread 
use of overlapping in sequences, few analytical models 
have been developed to address this mechanism. While 
this is a promising area for future research, it is currently 
beyond the purview of this review. 


(6) Timing 


The temporal spacing between units in a sequence can 
contain information. In the simplest case, pulse rate and 
interpulse interval can distinguish between different 
species, for example in insects and anurans (Gerhardt 
& Huber, 2002; Nityananda & Bee, 2011), rodents 
(Randall, 1997), and primates (Hauser, Agnetta & 
Perez, 1998). Call timing can indicate fitness and 
aggressive intent, e.g. male howler monkeys Alouatta 
pigra attend to howling delay as an indicator of 
aggressive escalation (Kitchen, 2004). Additionally, 
when sequences are produced by different individ- 
uals, a receiver may interpret the timing differences 
between the producing individuals to obtain contextual 
information. For instance, ground squirrels Sper- 
mophilus richarsonii use the spatial pattern and temporal 
sequence of conspecific alarm calls to provide informa- 
tion on a predator's movement trajectory (Thompson 
& Hare, 2010). This information only emerges from the 
sequence of different callers initiating calls (Blumstein, 
Verneyre & Daniel, 2004). Such risk tracking could 
also emerge from animals responding to sequences of 
heterospecific alarm signals produced over time. 


(7) Information-embedding paradigms: conclusions 


The use of multiple embedding techniques may be quite 
common, for instance in intrasexual competitive and 
intersexual reproductive contexts (Gerhardt & Huber, 
2002). For example, many frog species produce pulsatile 
advertisement calls consisting of the same repeated 
element. If itis the case that both number of pulses and 
pulse rate affect receiver responses, as shown in some 
hylid treefrogs (Gerhardt, 2001), then information is 
being embedded using both the Repetition and the 
Timing paradigms simultaneously. 
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Fig. 7. Flowchart suggesting possible paths for the analysis of sequences of acoustic units. Exploratory data analysis is 
conducted on the units or timing information using techniques such as histograms, networks, orlow-order Markov models. 
Preliminary embedding paradigm hypotheses are formed based on observations. Depending upon the hypothesised 
embedding paradigm, various analysis techniques are suggested. HMM, hidden Markov model. 


Before hypothesising a specific structuring paradigm, 
it is frequently useful to perform exploratory data anal- 
ysis (Fig. 7). This might begin by looking at histograms, 
networks, or low-order Markov models that are based 
on acoustic units or timing between units. This analysis 
can be on the raw acoustic units or may involve pre- 
processing. An example of preprocessing that might be 
helpful for hypothesising Repetition would be to create 
histograms that count the number of times that acous- 
tic units occur within a contiguous sequence of vocal- 
isations. As an example, if 12 different acoustic units 
each occurred three times, a histogram bin represent- 
ing three times would have a value of 12; for examples, 
see Jurafsky & Martin (2000). For histograms or net- 
works, visual analysis can be used to determine if there 
are any patterns that bear further scrutiny. Metrics such 
as entropy can be used to provide an upper bound on 
how well a Markov chain model describes a set of vocal- 
isations (smaller numbers are better, as an entropy of 
zero indicates that we model the data perfectly). If noth- 
ing is apparent, it might mean that there is no structure 
to the acoustic sequences, but it also possible that the 
quantity of data are insufficient to reveal the structure 
or that the structure is more complex than what can be 
revealed through casual exploratory data analysis. 


Exploratory data analysis may lead to hypotheses 
that one or more of the embedding paradigms for 
acoustic sequences may be appropriate. At this point 
a greater effort should be put into the modelling and 
understanding and we provide a suggested flow of 
techniques (Fig. 7). It is important to keep in mind 
that these are only suggestions. For example, while 
we suggest that a grammar (Section V.4) be modelled 
if there is evident and easily described structure for 
Repetition, Diversity, and Ordering, other models could be 
used effectively and machine learning techniques for 
generating grammars may be able to do so when the 
structure is less evident. 

We conclude this section with a discussion of two 
examples of how sequences of acoustic signals pro- 
duced by signallers can influence meaning to receivers. 
These two examples come from primates and exem- 
plify the Diversity and Ordering types of sequences illus- 
trated in Fig. 6. The example of the Diversity type is 
the system of serial calls of titi monkeys, Callicebus mol- 
loch, used in a wide range of social interactions. Here, 
the calls comprise several distinct units, many of which 
are produced in sequences. Importantly, the units of 
this call system seem to have meaning primarily in the 
context of the sequence - this call system therefore 
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seems to represent the notion of phonological syntax 
(Marler, 1977). One sequence has been tested via play- 
back studies — the ‘honks—bellows—pumps’ sequence is 
used frequently by males that are isolated from and 
not closely associated with females and may recruit 
non-paired females (Robinson, 1979). Robinson (1979) 
played back typical sequences of honks—bellows—pumps 
sequences and atypical (i.e. reordered) sequences of 
honks-pumps- bellows and found little evidence that 
groups of titi monkeys responded differently to the two 
playbacks (although they gave one call type — a ‘moan’, 
produced often during disturbances caused by other 
conspecific or heterospecific monkey groups — more 
often to the atypical sequences). 

The second example relates to the Ordering type of 
sequence (Fig. 6), and stems from two common calls 
of putty-nosed monkeys, Cercopithecus nictitans martini. 
‘Pyow’ calls can be produced individually or in strings 
of pyows, and seem to be used by putty-nosed monkeys 
frequently when leopards are detected in the environ- 
ment (Arnold & Zuberbühler, 20062), and more gener- 
ally as an attention-getting signal related to recruitment 
of receivers and low-level alarm (Arnold & Zuberbüh- 
ler, 2013). ‘Hack’ calls can also be produced individually 
or in strings of hacks, and seem to be used frequently 
when eagles are detected in the environment, and more 
generally as a higher-level alarm call (Arnold & Zuber- 
bühler, 2013). Importantly, pyow and hack calls are 
frequently combined into pyow-hack sequences. Both 
naturalistic observational data as well as experimental 
call playback results indicate that pyow—hack sequences 
influence receiver behaviour differently than do pyow 
or hack sequences alone — pyow-hack sequences seem 
to mean ‘let’s go!’ and produce greater movement dis- 
tances in receivers (Arnold & Zuberbühler, 20062). 
The case of the pyow-hack sequence therefore seems 
to represent something closer to the notion of lexical 
syntax — individual units and ordered combinations of 
those units have distinct meanings from one another 
(Marler, 1977). 

These two examples of primate calls illustrate the 
simple but important point that sequences matter in 
acoustic signals — combinations or different linear 
orderings of units (whether those units have meaning 
individually or not) can have different meanings to 
receivers. In the case of titi monkeys, the call sequences 
seem to serve the function of female attraction for 
male signallers, whereas in the case of putty-nosed 
monkeys, the call sequences serve anti-predatory and 
group-cohesion functions. 


V. ANALYSIS OF SEQUENCES 


Given that the researcher has successfully determined 
the units of an acoustic sequence that are appropriate 
for the hypothesis being tested, one must select 
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and apply appropriate algorithms for analysing the 
sequence of units. Many algorithms exist for the analy- 
sis of sequences: both those produced by animals, and 
sequences in general (such as DNA, and stock market 
prices). Selection of an appropriate algorithm can 
sometimes be guided by the quantity and variability of 
the data, but there is no clear rule to be followed. In 
fact, in machine learning, the so-called ‘no free lunch’ 
theorem (Wolpert & Macready, 1997) shows that there 
is no one pattern-recognition algorithm that is best for 
every situation, and any improvement in performance 
for one class of problems is offset by lower performance 
in another problem class. In choosing an algorithm for 
analyses, one should be guided by the variability and 
quantity of the data for analysis, keeping in mind that 
models with more parameters require more data to 
estimate the parameters effectively. 

We consider five models in this section: (7) Markov 
chains, (ii) hidden Markov models, (iii) network mod- 
els, (?v) formal grammars, and (v) temporal models. 
Each of these models has been growing in popularity 
among researchers, with the number of publications 
increasing in recent years. The number of publications 
in 2013 mentioning both the terms ‘animal communi- 
cation' as well as the model name has grown since 2005 
by a factor of: ‘Markov’, 4.9; ‘hidden Markov’, 3.3; ‘net- 
work’, 2.6; ‘grammar’ 1.7; ‘timing’, 2.3. 

The structure-analysis algorithms discussed through- 
out this section can be used to model the different 
methods for combining units discussed earlier (Fig. 6). 
Repetition, Diversity, and Ordering are reasonably well 
captured by models such as Markov chains, hidden 
Markov models, and grammars. Networks capture struc- 
ture either with or without order, although much of the 
application of networks has been done on unordered 
associations (Combination). Temporal information can 
be modelled as an attribute of an acoustic unit requir- 
ing extensions to the techniques discussed below, or as 
a separate process. Table 2 summarises the assumptions 
and requirements for each of these models. 

Here we give a sample of some of the more impor 
tant and more promising algorithms for animal acous- 
tic sequence analysis, and discuss ways for selecting and 
evaluating analytical techniques. Selecting appropriate 
algorithms should involve the following steps. (7) Tech- 
nique: understand the nature of the models and their 
mathematical basis. (ii) Suitability: assess the suitabil- 
ity of the models and their constraints with respect 
to the research questions being asked. (22) Applica- 
tion: apply the models to the empirical data (training, 
parameter estimation). (iv) Assessment: extract met- 
rics from the models that summarise the nature of 
the sequences analysed. (v) Inference: compare met- 
rics between data sets (or between empirical data and 
random null-models) to draw ecological, mechanistic, 
evolutionary, and behavioural inferences. (vz) Validate: 
determine the goodness of fit of the model to the data 
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Table 2. A summary of the assumptions and requirements for each of the five different structure analysis models suggested 


in the review 











Typical 
Model type Embedding type Data requirements hypotheses Assumptions 
Markov chain Repetition Number of observations Independence of Stationary 
Diversity required increases sequence transition 
Ordering greatly as the size of Sequential structure matrix 
the model grows Sufficient data for 
maximum 
likelihood 
estimator of 
transition 
matrix 
Hidden Markov Repetition Number of observations Non-stationary Sufficient data to 
model Diversity required increases transitions of estimate 
Cua greatly as the size of observable states hidden states 
8 the model grows Long-range correlations 
Existence of cognitive 
states 
Network Combination Many unit types Network metrics have The properties of 
Ordering biological meaning relations 
Comparison of motifs between units 
are meaningful 
Formal grammar Repetition Few requirements Linguistic hypotheses Deterministic 
Diversity Deterministic sequences transition rules 
Ordering Place in Chomsky 
hierarchy 
Temporal Overlapping Timing information Production/perception Temporal 
structure Timing exists mechanisms pid cus it 
No need to define units Changes with porer WE Dy 
time/ effect receiver 








and uncertainty of parameter estimates. Bootstrapping 
techniques can allow validation with sets that were not 
used in model development. 


(1) Markov chains 


Markov chains, or N-grams models, capture structure 
in acoustic unit sequences based on the recent history 
of a finite number of discrete unit types. Thus, the 
occurrence of a unit (or the probability of occurrence 
of a unit) is determined by a finite number of previous 
units. The history length is referred to as the order, 
and the simplest such model is a zeroth order Markov 
model, which assumes that each unit is independent 
of another, and simply determines the probability of 
observing any unit with no prior knowledge. A first 
order Markov model is one in which the probability of 
each unit occurring is determined only by the preceding 
unit, together with the ‘transition probability’ from one 
unit to the next. This transition probability is assumed to 
be constant (stationary). Higher order Markov models 
condition the unit probabilities based on more than 
one preceding units, as determined by the model order. 


An N-gram model conditions the probability on the 
N —]1 previous units, and is equivalent to an N — lth 
order Markov model. A Kth order Markov model of a 
sequence with C distinct units is defined by at most a 
C* x C matrix of transition probabilities from each of 
the CK possible preceding sequences, to each of the 
C possible subsequent units, or equivalently by a state 
transition diagram (Fig. 8). 

As the order of the model increases, more and more 
data are required for the accurate estimation of tran- 
sition probabilities, i.e. sequences must be longer, and 
many transitions will have zero counts. This is partic- 
ularly problematic when looking at new data, which 
may contain sequences that were not previously encoun- 
tered, as they will appear to have zero probability. As 
a result, Markov models with orders greater than 2 
(trigram, N —3) are rare. In principle, a Kth order 
Markov model requires sufficient data to provide accu- 
rate estimates of Ct! transition probabilities. In many 
cases, the number of possible transitions is similar to, 
or larger than, the entire set of empirical data. For 
example, Briefer ef al. (2010) examined very exten- 
sive skylark Alauda arvensis sequences totalling 16829 
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Fig. 8. State transition diagram equivalent to a second 
order Markov model and trigram model (N —3) for a 
sequence containing As and Bs. 


units, but identified over 340 unit types. As a naive 
transition matrix between all unit types would contain 
340 x 340 = 115600 cells, the collected data set would 
be too small to estimate the entire matrix. A different 
problem occurs when, as is commonly the case, animal 
acoustic sequences are short. Kershenbaum et al. (2012) 
examined rock hyrax Procavia capensis sequences that 
are composed of just five unit types. However, 8196 of 
the recorded sequences were only five or less units long. 
For these short sequences, 5° 23195 different combi- 
nations are possible — which is greater than the num- 
ber of such sequences recorded (2374). In these cases, 
estimates of model parameters, and conclusions drawn 
from them, may be quite inaccurate (Cover & Thomas, 
1991; Hausser & Strimmer, 2009; Kershenbaum, 2013). 

Closed-form expressions for maximumdikelihood 
estimates of the transition probabilities can be used 
with conditional counts (Anderson & Goodman, 1957). 
For example, assuming five acoustic units (A—E), 
maximum-likelihood estimates of the transition proba- 
bilities for a first-order Markov model (bigram, N —2) 
can be found directly from the number of occurrences 
of each transition, e.g. 


P(BIA) = — “(Àp  —— (1) 
count (A, i) 


i€{ A,B,C,D,E} 


Although not widely used in the animal communica- 
tion literature, research in human natural language pro- 
cessing has led to the development of methods known 
as back-off models (Katz, 1987), which account for 
the underestimated probability of rare sequences using 
Good-Turing counts, a method for improving esti- 
mated counts for events that occur infrequently (Gale 
& Sampson, 1995). When a particular state transition 
is never observed in empirical data, the back-off model 
offers the minimum probability for this state transition 
so as not to rule it out automatically during the testing. 
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Standard freely available tools, such as the SRI language 
modelling toolkit (Stolcke, 2002), implement back-off 
models and can reduce the effort of adopting these 
more advanced techniques. 

Once Markovian transitions have been calculated and 
validated, the transition probabilities can be used to cal- 
culate a number of summary metrics using information 
theory (Shannon et al., 1949; Chatfield & Lemon, 1970; 
Hailman, 2008). Fora review on the mathematics under- 
lying information theories, we direct the readers to the 
overview in McCowan, Hanser & Doyle (1999) or Free- 
berg & Lucas (2012), which provides the equations as 
well as a comprehensive reference list to other previ- 
ous work. Here we will define these quantitative mea- 
sures with respect to their relevance in analysing animal 
acoustic sequences. Zero-order entropy measures reper- 
toire diversity: 

A = log, (C) (2) 


where, C=|V| is the cardinality of the set of acoustic 
units V. First-order entropy H, begins to measure 
simple repertoire internal organisational structure by 
evaluating the relative frequency of use of different 
signal types in the repertoire: 


H, = Y, -P (ui) log, P (u) (3) 


€ V 


where the probability P(v;) of each acoustic unit i is typ- 
ically estimated based on frequencies of occurrence, as 
described earlier with N-grams. Higher-order entropies 
measure internal organisational structure, and thus 
one form of communication complexity, by examining 
how signals interact within a repertoire at the two-unit 
sequence level, the three-unit sequence level, and so 
forth. 

One inferential approach is to calculate the entropic 
values from first-order and higher-order Markov models 
to summarise the extent to which sequential structure is 
present at each order. A random sequence would show 
no dependence of entropy on Markov order, whereas 
decreases in entropy as the order is increased would 
be an indication of sequential organisation, and thus 
higher communication complexity (Ferreri-Cancho & 
McCowan, 2012). These summary measures can then be 
further extended to compare the importance of sequen- 
tial structure across different taxa, social and ecological 
contexts. These types of comparisons can provide novel 
insights into the ecological, environmental, social, and 
contextual properties that shape the structure, organ- 
isation, and function of signal repertoires (McCowan, 
Doyle & Hanser, 2002). 

The most common application of the Markov model 
is to test whether or not units occur independently in 
a sequence. Model validation techniques include the 
sequential and X? tests (Anderson & Goodman, 1957). 
For instance, Narins ef al. (2000) used a permutation 
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Fig. 9. State transition diagram of a two-state (X, Y) 
hidden Markov model capable of producing sequences 
of acoustic units A and B. When in state X, acous- 
tic units emission of signals A and B are equally likely 
P,(A|X) = P,(B|X) 20.5, and when in state Y, acoustic unit 
A is much more likely P,(A|Y) 20.9 than B P,(B|Y) =0.1. 
Transitioning from state X to state Y occurs with probabil- 
ity P(X > Y) «0.6, whereas from state Y to state X with 
probability P,(Y > X) = 0.3. 


PY — Y)-0.7 


test (Adams & Anthony, 1996) to evaluate the hypoth- 
esis that a frog with an exceptionally large vocal reper- 
toire, Bufo madagascariensis, emitted any call pairs more 
often than would be expected by chance. Similar tech- 
niques were used to show non-random call production 
by Sayigh et al. (2012) with short-finned pilot whales 
Globicephala macrorhynchus, and by Bohn et al. (2009) 
with free-tailed bats 7adarida brasiliensis. However, devi- 
ation from statistical independence does not in itself 
prove a sequence to have been generated by a Markov 
chain. Other tests, such as N-gram distribution (Jin & 
Kozhevnikov, 2011) may be more revealing. 


(2) Hidden Markov models 


Hidden Markov models (HMMs) are a generalisation 
of the Markov model. In Markov models, the acoustic 
unit history (of length N) can be considered the current 
'state' of the system. In HMMs (Rabiner, 1989), states 
are not necessarily associated with acoustic units, but 
instead represent the state of some possibly unknown 
and unobservable process. Thus, the system progresses 
from one state to another, where the nature of each state 
is unknown to the observer. Each of these states may 
generate a ‘signal’ (i.e. a unit), but there is not necessar- 
ily a one-to-one mapping between state transitions and 
signals generated. For example, transitioning to state 
X might generate unit A, but the same might be true 
of transitioning to state Y. An observation is generated 
at each state according to a state-dependent probability 
density function, and state transitions are governed by 
a separate probability distribution (Fig. 9). HMMs are 
particularly useful to model very complex systems, while 
still being computationally tractable. 

Extensions to the HMM model also exist, in which 
the state transition probabilities are non-stationary. For 
example, the probability of remaining in the same state 
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may decay with time e.g. due to neural depletion, as 
shown by Jin & Kozhevnikov (2011), or recurrent units 
may appear more often than expected by a Markov 
model, particularly where behavioural sequences are 
non-Markovian (Cane, 1959; Kershenbaum, 2013; 
Kershenbaum et al., 2014). Also, HMMs are popular in 
speech analysis (Rabiner, 1989), where emissions are 
continuous-valued, rather than discrete. 

HMMs have been used fairly extensively in speaker 
recognition (Lee & Hon, 1989), the identification of 
acoustic units in birdsong (Trawicki, Johnson & Osiejuk, 
2005), and other analyses of birdsong sequences. ten 
Cate, Lachlan & Zuidema (2013) reviewed analytical 
methods for inferring the structure of birdsong and 
highlighted the idea that HMM states can be thought 
of as possibly modelling an element of an animal’s cog- 
nitive state. This makes it possible to build models that 
have multiple state distributions for the same acoustic 
unit sequence. For instance, in the trigram AAC, the 
probability given by the second order Markov model, 
P(C|A, A) is fixed. There cannot be different distri- 
butions for observing the unit C, if the previous two 
units are A. Yet cognitive state may have the poten- 
tial to influence the probability of observing C, even 
for identical sequence contexts (AA). Another state 
variable (0) exists unobserved, as it reflects cognitive 
state, rather than sequence history. In this case, P(C|A, 
A,0 20) Z P(C|A, A, = D. Hahnloser, Kozhevnikov & 
Fee (2002), Katahira et al. (2011), and Jin (2009) have 
used HMMs to model the interaction between song and 
neural substrates in the brain. A more recent example 
of this can be seen in the work of Jin & Kozhevnikov 
(2011), where they used states to model neural units in 
song production of the Bengalese finch Lonchura striata 
ver. domestica, restricting each state to the emission of 
a single acoustic unit, thus making acoustic units asso- 
ciated with each state deterministic while retaining the 
stochastic nature of state transitions. 

Because the states of a HMM represent an unobserv- 
able process, it is difficult to estimate the number of 
states needed to describe the empirical data adequately. 
Model selection methods and criteria (for example 
Akaike and Bayesian information criteria, and others) 
can be used to estimate model order - see Hamaker, 
Ganapathiraju & Picone (1998) and Zucchini & 
MacDonald (2009) for a brief review — so the num- 
ber of states is often determined empirically. Increasing 
the number of states permits the modelling of more 
complex underlying sequences (e.g. longer term depen- 
dencies), but increases the amount of data required 
for proper estimation. The efficiency and accuracy of 
model fitting depends on model complexity, so that 
models with many states, many acoustic units, and 
perhaps many covariates or other conditions will take 
more time and require more data to fit. 

During training, HMM parameters are estimated 
using an optimisation algorithm (Cappé, Moulines & 
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Fig. 10. Simple networks constructed from the sequence of acoustic units A, B and C. The undirected binary network 
(A) simply indicates that A, B, and C are associated with one another without any information about transition direction. 
The directed binary network (B) adds ordering information, for example that C cannot follow A. The weighted directed 
network (C) shows the probabilities of the transitions between units based on a bigram model. 


Rydén, 2005) that finds a combination of hidden states, 
state transition tables, and state-dependent distributions 
that best describe the data. Software libraries for the 
training of HMMs are available in many formats, e.g. 
the Matlab function hmmtrain, the R package HMM 
(R Development Team, 2012), and the Hidden Markov 
Model Toolkit (Young & Young, 1994). Similar consid- 
erations of data set completeness exist to those when 
generating regular Markov models, most importantly, 
that long sequences of data are required. 

Although the states of a HMM are sometimes pos- 
tulated to possess biologically relevant meaning, the 
internal states of the HMM represent a hidden process, 
and do not necessarily refer to concrete behavioural 
states. Specifically, the training algorithm does not con- 
tain an optimisation criterion that will necessarily asso- 
ciate model states with the functional or ecological 
states of the animal that a researcher is interested in 
observing (e.g. foraging, seeking a mate, etc.). While 
the functional/ecological state is likely related to the 
sequence, each model state may in fact represent a dif- 
ferent subsequence of the data. Therefore, one cannot 
assume in general that there will be a one-to-one map- 
ping between model and animal states. Specific hidden 
Markov models derived from different empirical data 
are often widely different, and it can be misleading to 
make comparisons between HMMs derived from differ- 
ent data sets. Furthermore, obtaining consistent states 
requires many examples with respect to the diversity of 
the sequence being modelled. An over-trained network 
will be highly dependent on the data presented to it and 
small changes in the training data can result in very dif- 
ferent model parameters, making state-based inference 
questionable. 


(3) Network models 


The structure of an acoustic sequence can also be 
described using a network approach - reviewed in 
Newman (2003) and Baronchelli et al. (2013) — as has 
been done for other behavioural sequences, e.g. pollen 
processing by honeybees (Fewell, 2003). A node in the 
network represents a type of unit, and a directional 
edge connecting two nodes means that one unit comes 


after the other in the acoustic sequence. For example, 
if a bird sings a song in the order: ABCABC; the network 
representing this song will have three nodes for A, B, 
and C, and three edges connecting A to B, B to C, and C 
to A (Fig. 10). The edges may simply indicate association 
between units without order (undirected binary net- 
work), an ordered sequence (directed binary network), 
or a probability of an ordered sequence (directed 
weighted network), the latter being equivalent to a 
Markov chain (Newman, 2009). 

The network representation is fundamentally similar 
to the Markov model, and the basic input for construct- 
ing a binary network is a matrix of unit pairs within the 
repertoire, which corresponds to the transition matrix 
in a Markov model. However, the network represen- 
tation may be more widely applicable than a Markov 
analysis, particularly when a large number of distinct 
unit types exist, precluding accurate estimation of tran- 
sition probabilities (e.g. Sasahara et al., 2012; Deslandes 
et al., 2014; Weiss et al., 2014). In this case, binary or 
simple directed networks may capture pertinent prop- 
erties of the sequence, even if transition probabilities 
are unknown. 

One of the attractive features of network analysis is 
that a large number of quantitative network measures 
exist for comparison to other networks (e.g. from dif- 
ferent individuals, populations, or species), or for test- 
ing hypotheses. We list a few of the popular algorithms 
that can be used to infer the structure of the acous- 
tic sequence using a network approach. We refer the 
reader to introductory texts to network analysis for fur- 
ther details (Newman, 2009; Scott & Carrington, 2011). 

Degree centrality measures the number of edges directly 
connected to each node. In a directed network, each 
node has an in-degree and an out-degree, correspond- 
ing to incoming and outgoing edges. The weighted 
version of degree centrality is termed strength central- 
ity, which takes into account the weights of each edge 
(Barrat et al., 2004). Degree/strength centrality identi- 
fies the central nodes in the network, corresponding to 
central elements in the acoustic sequence. For example, 
in the mockingbird Mimus polyglottos, which imitates 
sounds of other species, its own song is central in 
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the network, meaning that it usually separates between 
other sounds by singing its own song (Gammon & 
Altizer, 2011). 

Betweenness centrality is a measure of the role a central 
node plays in connecting other nodes. For example, 
if an animal usually uses three units before moving 
to another group of units, a unit that lies between 
these groups in the acoustic sequence will have high 
betweenness centrality. A weighted version of between- 
ness centrality was described in Opsahl, Agneessens & 
Skvoretz (2010). 

Clustering coefficient describes how many triads of nodes 
are closed in the network. For example, if unit A is 
connected to B, and B is connected to C, a cluster 
is formed if A is also connected to C. Directed and 
weighted versions of the clustering coefficient have also 
been described (Barrat et al., 2004; Fagiolo, 2007). 

Mean path length is defined as the average minimum 
number of connections to be crossed from any arbitrary 
node to any other. This measures the overall navigability 
in the network; as this value becomes large, a longer 
series of steps is required for any node to reach another. 

Small-world metric measures the level of connected- 
ness of a network and is the ratio of the clustering 
coefficient C to the mean path length L after nor- 
malising each with respect to the clustering coeffi- 
cient and mean path length of a random network: 
S= (C/C) / (L/Lyang). If S> 1 the network is regarded 
as ‘small-world’ (Watts & Strogatz, 1998; Humphries 
& Gurney, 2008), with the implication that nodes are 
reasonably well connected and that it does not take a 
large number of edges to connect most pairs of nodes. 
Sasahara et al. (2012) demonstrated that the network 
of California thrasher Toxostoma redivivum songs has 
a small-world structure, in which subsets of phrases 
are highly grouped and linked with a short mean 
path length. 

Network motifs are recurring structures that serve as 
building blocks of the network (Milo et al., 2002). For 
example, a network may feature an overrepresentation 
of specific types of triads, tetrads, or feed-forward loops. 
Network motif analysis could be informative in com- 
paring sequence networks from different individuals, 
populations or species. We refer the reader to three soft- 
ware packages available for motif analysis: FANMOD 
(Wernicke & Rasche, 2006); MAVisto (Schreiber & 
Schwóbbermeyer, 2005); and MFinder (Kashtan et al., 
2002). 

Communily detection algorithms offer a method to 
detect network substructure objectively (Fortunato, 
2010). These algorithms identify groups of nodes with 
dense connections between them but that are sparsely 
connected to other groups/nodes. Subgroups of nodes 
in a network can be considered somewhat independent 
components of it, offering insight into the different 
subunits of acoustic sequences. Multi-scale commu- 
nity detection algorithms can be useful for detecting 
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hierarchical sequence structures (Fushing & McAssey, 
2010; Chen & Fushing, 2012). 

Exponential family Random Graph Models (ERGMs) offer 
a robust analytic approach to evaluate the contribution 
of multiple factors to the network structure using sta- 
tistical modelling (Snijders, 2002). These factors may 
include structural factors (e.g. the tendency to have 
closed triads in the network), and factors based on 
node or edge attributes (e.g. a tendency for connections 
between nodes that are acoustically similar). The goal 
of ERGMs is to predict the joint probability that a set of 
edges exists on nodes in a network. The R programming 
language package s/atnet has tools for model estimation 
and evaluation, and for model-based network simula- 
tion and network visualisation (Handcock et al., 2008). 

As with other models, many statistical tests for infer- 
ence and model assessment require a comparison of 
the observed network to a set of random networks. For 
example, the clustering coefficient of an observed net- 
work can be compared to those of randomly generated 
networks, to test if it is significantly smaller or larger 
than expected. A major concern when constructing ran- 
dom networks is what properties of the observed net- 
work should be retained (Croft, James & Krause, 2008). 
The answer to this question depends on the hypothesis 
being tested. For example, when testing the significance 
of the clustering coefficient, itis reasonable to retain the 
original number of nodes and edges, density and possi- 
bly also the degree distribution, such that the observed 
network is compared to random networks with similar 
properties. 

Several software packages exist that permit the com- 
putation of many of the metrics from this section 
that can be used to make inferences about the net- 
work. Examples include UCINet (Borgatü, Everett & 
Freeman, 2002), Gephi (Bastian, Heymann & Jacomy, 
2009), igraph (Csardi & Nepusz, 2006) and Cytoscape 
(Shannon et al., 2003). 


(4) Formal grammars 


The structure of an acoustic sequence can be described 
using formal grammars. A grammar consists of a set 
of rewrite rules (or ‘productions’) that define the ways 
in which units can be ordered. Grammar rules consist 
of operations performed on 'terminals' (in our case, 
units), which are conventionally denoted with lower 
case letters, and non-terminals (symbols that must be 
replaced by terminals before the derivation is com- 
plete), conventionally denoted with upper case letters 
(note that this convention is inconsistent with the upper 
case convention used for acoustic unit labels). Gram- 
mars generate sequences iteratively, by applying rules 
repeatedly to a growing sequence. For example, the rule 
'U— aW' means that the nonterminal U can be rewrit- 
ten with the symbols ‘a W.' The terminal a is a unit, as 
we are familiar with, but as W is a non-terminal, and may 
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Fig. 11. Grammar (rewrite rules) for approximating the sequence of acoustic units produced by Eastern Pacific blue whales 
Balaenoptera musculus. There are three acoustic units, a, b, and d (Oleson, Wiggins & Hildebrand, 2007), and the sequence 
begins with a start symbol S. Individual b or d calls may be produced, or song, which consists of repeated sequences of an 
a call followed by one or more b calls. The symbol | indicates a choice, and e, the empty string, indicates that the rule is 
no longer used. A derivation is shown for the song abbab. Underlined variables indicate those to be replaced. Grammar 
produced with contributions from Ana Sirovié (Scripps Institution of Oceanography). 


itself be rewritten by a different rule. For an example, 
see Fig. 11. 

Sequences that can be derived by a given grammar 
are called grammatical with respect to that grammar. 
The collection of all sequences that could possibly be 
generated by a grammar is called the language of the 
grammar. The validation of a grammar consists of veri- 
fying that the grammar's language matches exactly the 
set of sequences to be modelled. If a species produces 
sequences that cannot be generated by the grammar, 
the grammar is deemed ‘over-selective’. A grammar that 
is ‘over-generalising’ produces sequences not observed 
in the empirical data — although it is often unclear 
whether this represents a true failure of the grammar, 
or insufficient sampling of observed sequences. In the 
example given in Fig. 11, the grammar is capable of 
producing the sequence abbbbbbbbbbbbb, however, since 
blue whales have not been observed to produce simi- 
lar sequences in decades of observation, we conclude 
that this grammar is overgeneralising. It is important 
to note, however, that formal grammars are determin- 
istic, in contrast to the probabilistic models discussed 


previously (Markov model, HMM). If one assigned prob- 
abilities to each of the rewriting rules, the particular 
sequence shown above may not have been observed sim- 
ply because it is very unlikely. 

Algorithms known as parsers can be constructed from 
grammars to determine whether a sequence belongs to 
the language for which the grammar has been inferred. 
Inferring a grammar from a collection of sequences 
is a difficult problem, which, as famously formulated 
by Gold (1967), is intractable for all but a number 
of restricted cases. Gold's formulation, however, does 
not appear to preclude the learning of grammar in 
real-world examples, and is of questionable direct rel- 
evance to the understanding or modelling of the psy- 
chology of sequence processing (Johnson, 2004). When 
restated in terms that arguably fit better the cogni- 
tive tasks faced by humans and other animals, gram- 
mar inference becomes possible (Clark, 2010; Clark, 
Eyraud & Habrard, 2010). Algorithms based on distribu- 
tional learning, which seek probabilistically motivated 
phrase structure by recursively aligning and comparing 
input sequences, are becoming increasingly successful 
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Fig. 12. The classes of formal grammars known as the 
Chomsky hierarchy (Chomsky, 1957). Each class is a gen- 
eralisation of the class it encloses, and is more com- 
plex than the enclosed classes. Image publicly available 
under the Creative Commons Attribution-Share Alike 
3.0 Unported license (https://commons.wikimedia.org/ 
wiki/File:Wiki inf chomskeho hierarchia.jpg). 


in sequence-processing tasks such as modelling lan- 
guage acquisition (Solan et al., 2005; Kolodny, Lotem & 
Edelman, in press). 

A grammar can be classified according to its place 
in a hierarchy of classes of formal grammars known 
as the Chomsky hierarchy (Chomsky, 1957) and illus- 
trated in Fig. 12. These classes differ in the complexity 
of languages that can be modelled. The simplest class 
of grammars are called regular grammars, which are 
capable of describing the generation of any finite set 
of sequences or repeating pattern, and are fundamen- 
tally similar to Markov models. Figure 11 is an example 
of a regular grammar. Kakishita et al. (2009) showed 
that Bengalese finch Lonchura striata ver. domestica songs 
can be modelled by a restricted class of regular gram- 
mars, called ‘k-reversible regular grammars,’ which is 
learnable from only positive samples, i.e. observed and 
hence permissible sequences, without information on 
those sequences that are not permissible in the gram- 
mar. Contextfree grammars are more complex than 
regular grammars and are able to retain state infor- 
mation that enable one part of the sequence to affect 
another; this is usually demonstrated through the abil- 
ity to create sequences of symbols where each unit 
is repeated the same number of times A"B" where n 
denotes n repetitions of the terminal unit, e.g. AAABBB 
(A*B*). Such an ability requires keeping track of a 
state, e.g. 'how many times the unit A has been used', 
and a neurological implementation may be lacking in 
most species (Beckers et al., 2012). Context-sensitive lan- 
guages allow context-dependent rewrite rules that have 
few restrictions, permitting further reaching dependen- 
cies such as in the set of sequences A" B” C”, and require 
still more sophisticated neural implementations. The 
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highest level in the Chomsky hierarchy, recursively enu- 
merable grammars, are more complex still, and rarely 
have relevance to animal communication studies. 

The level of a grammar within the Chomsky hierarchy 
can give an indication of the complexity of the com- 
munication system represented by that grammar. Most 
animal acoustic sequences are thought to be no more 
complex than regular grammars (Berwick et al., 2011), 
whereas complexity greater than the regular grammar 
is thought to be a unique feature of human language 
(Hauser et al., 2002). Therefore, indication that any 
animal communication could not be represented by a 
regular grammar would be considered an important 
discovery. For example, Gentner ef al. (2006) proposed 
that European starlings Sturnus vulgaris can learn to 
recognise context-free (but non-regular) sequences, 
and reject sequences that do not correspond to the 
learned grammar. However, other authors have pointed 
out that the observed results could be explained by 
more simple mechanisms than context-free processing, 
such as primacy rules (Van Heijningen et al., 2009) in 
which simple analysis of short substrings is sufficient to 
distinguish between grammatical and non-grammatical 
sequences, or acoustic similarity matching (Beckers 
et al., 2012). Consequently, claims of greater than reg- 
ular grammar in non-human animals have not been 
widely accepted. The deterministic nature of regular 
grammars — or indeed any formal grammars — may 
explain why formal grammars are not sufficiently 
general to describe the sequences of many animal 
species, and formal grammars remain more popular in 
human linguistic studies than in animal communication 
research. 


(5) Temporal structure 


Information may exist in the relative or absolute timing 
of acoustic units in a sequence, rather than in the order 
of those units. In particular, timing and rhythm informa- 
tion may be of importance, and may be lost when acous- 
tic sequences are represented as a series of symbols. 
This section describes two different approaches to quan- 
tifying the temporal structure in acoustic sequences: 
traditional techniques examining inter-event interval 
and pulse statistics (e.g. Randall, 1989; Narins etal., 
1992), and recent multi-timescale rhythm analysis (Saar 
& Mitra, 2008). 

Analyses of temporal structure can be applied to any 
audio recording, regardless of whether that recording 
contains recognisable sequences, individual sounds, 
or multiple simultaneously vocalising individuals. 
Such analyses are most likely to be informative, how- 
ever, when recurring acoustic patterns are present, 
especially if those recurring patterns are rhythmic or 
produced at a predictable rate. Variations in inter- 
active sound-sequence production during chorusing 
and cross-individual synchronisation can be quantified 
through meter, or prosody analysis, and higher-order 
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sequence structure can be identified through auto- 
mated identification of repeating patterns. At the 
simplest level, it is possible to analyse the timing of 
sounds in a sequence, simply by recording when sound 
energy is above a fixed threshold. For instance, tem- 
poral patterns can be extracted automatically from 
simpler acoustic sequences by transforming recordings 
into sequences of numerical measures of the durations 
and silent intervals between sounds (Isaac & Marler, 
1963; Catchpole, 1976; Mercado, Herman & Pack, 
2003; Handel, Todd & Zoidis, 2009; Green et al., 2011), 
song bouts (Eens, Pinxten & Verheyen, 1989; Saar & 
Mitra, 2008), or of acoustic energy within successive 
intervals (Murray, Mercado & Roitblat, 1998; Mercado 
et al., 2010). Before the invention ofthe Kay sonograph, 
which led to the routine analysis of audio spectrograms, 
temporal dynamics of birdsong were often transcribed 
using musical notation (Saunders, 1951; Nowicki & 
Marler, 1988). 

Inter-pulse interval has been widely used to quantify 
temporal structure in animal acoustic sequences, for 
example in kangaroo rats Dipodomys spectabilis (Randall, 
1989), fruit flies Drosophila melanogaster (Bennet-Clark 
& Ewing, 1969), and rhesus monkeys Macaca mulatta 
(Hauser et al., 1998). Variations in pulse intervals can 
encode individual information such as identity and 
fitness (Bennet-Clark & Ewing, 1969; Randall, 1989), 
as well as species identity (Randall, 1997; Hauser et al., 
1998). In these examples, comparing the median 
inter-pulse interval between two sample populations is 
often sufficient to uncover significant differences. 

More recently developed techniques for analysis of 
temporal structure require more detailed processing. 
For example, periodic regularities and repetitions of 
patterns within recordings of musical performances can 
be automatically detected and characterised (Paulus, 
Müller & Klapuri, 2010; Weiss & Bello, 2011). The first 
step in modern approaches to analysing the temporal 
structure of sound sequences involves segmenting the 
recording. The duration and distribution of individual 
segments can be fixed (e.g. splitting a recording into 
100 ms chunks/frames) or variable (e.g. using multi- 
ple frame sizes in parallel or adjusting the frame size 
based on the rate and duration of acoustic events). The 
acoustic features of individual frames can then be anal- 
ysed using the same signal-processing methods that are 
applied when measuring the acoustic features of individ- 
ual sounds, thereby transforming the smaller waveform 
into a vector of elements that describe features of the 
segment. Sequences of such frame-describing vectors 
then would typically be used to form a matrix represent- 
ing the entire recording. In this matrix, the sequence of 
columns (or rows) corresponds to the temporal order 
of individual frames extracted from the recording. 

Regularities within the feature matrix generated from 
frame-describing vectors reflect temporal regularities 
within the original recording. Thus, the problem of 
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describing and detecting temporal patterns within a 
recording is transformed into the more computation- 
ally tractable problem of detecting and identifying 
structure within a matrix of numbers (as opposed to 
a sequence of symbols). If each frame is described 
by a single number (e.g. mean amplitude), then the 
resulting sequence of numbers can be analysed using 
standard time-frequency analysis techniques to reveal 
rhythmic patterns (Saar & Mitra, 2008). Alternatively, 
each frame can be compared with every other frame to 
detect similarities using standard measures for quanti- 
fying the distance between vectors (Paulus et al., 2010). 
These distances are then often collected within a sec- 
ond matrix called a self-distance matrix. Temporal reg- 
ularities within the original feature matrix are visible as 
coherent patterns with the self-distance matrix (typically 
showing up as patterned blocks or diagonal stripes). Var- 
ious methods used for describing and classifying pat- 
terns within matrices (or images) can then be used to 
classify these two-dimensional patterns. 

Different patterns in these matrices can be associated 
with variations in the novelty or homogeneity of the tem- 
poral regularities over time, as well as the number of 
repetitions of particular temporal patterns (Paulus et al., 
2010). Longitudinal analyses of time-series measures of 
temporal structure can also be used to describe the sta- 
bility or dynamics of rhythmic pattern production over 
time (Saar & Mitra, 2008). An alternative approach to 
identifying temporal structure within the feature matrix 
is to decompose it into simpler component matrices 
that capture the most recurrent features within the 
recording (Weiss & Bello, 2011). Similar approaches are 
common in modern analyses of high-density electroen- 
cephalograph (EEG) recordings (Makeig et al., 2004). 
Algorithms for analysing the temporal dynamics of brain 
waves may thus also be useful for analysing temporal 
structure within acoustic recordings. 


VI. FUTURE DIRECTIONS 


Many of the central research questions in animal com- 
munication focus on the meaning of signals and on the 
role of natural, sexual, and social selection in the evo- 
lution of communication systems. As shown in Fig. 6, 
information can exist in a sequence simultaneously via 
diversity, and order, as well as other less well-studied 
phenomena. Both natural and sexual selection may 
act on this information, either through conspecifics 
or heterospecifics (e.g. predators). This is especially 
true for animal acoustic sequences because the poten- 
tial complexity of a sequence may imply greater scope 
for both meaning and selective pressure. Many new 
questions — and several old and unanswered ones - can 
be addressed by the techniques that we have out- 
lined herein. Some of the most promising avenues 
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for future research are outlined below, with some out- 
standing questions in animal acoustic sequences that 
can potentially be addressed more effectively using the 
approaches proposed in this review. 


(1) As sequences are composed of units, how might 
information exist within units themselves? 


One promising direction lies in studying how animals 
use concatenated signals with multiple meanings. For 
example, Jansen, Cant & Manser (2012) provided evi- 
dence for temporal segregation of information within 
a syllable, where one segment of a banded mongoose 
Mungos mungo close call is individually distinct, while 
the other segment contains meaning about the caller's 
activity. Similar results have been demonstrated in the 
song of the white-crowned sparrow Zonotrichia leucophrys 
(Nelson & Poesel, 2007). Understanding how to divide 
acoustic units according to criteria other than silent 
gaps (Fig. 2) can change the research approach, as well 
as the results of a study. The presence of information 
in sub-divisions of traditional acoustic units is a subject 
underexplored in the field of animal communication, 
and an understanding of the production and perceptual 
constraints on unit definition (Fig. 4) is essential. 


(2) How does knowledge and analysis of sequences 
help us define and understand communication 
complexity? 


There is a long history of mathematical and physical 
sciences approaches to the question of complexity, 
which have typically defined complexity in terms of how 
difficult a system is to describe, how difficult a system 
is to create, or the extent of the system's disorder or 
organisation (Mitchell, 2009; Page, 2010). This is an 
area of heavy debate among proponents of different 
views of complexity, as well as a debate about whether 
a universal definition of complexity is even possible. In 
the life and social sciences, the particular arguments 
are often different from those of the mathematical and 
physical sciences, but a similar heavy debate about the 
nature of biological complexity exists (Bonner, 1988; 
McShea, 1991, 2009; Adami, 2002). 

Perceptual and developmental constraints may drive 
selection for communication complexity. However, 
complexity can exist at any one (or more) of the six lev- 
els of information encoding that we have detailed, often 
leading to definitions of communication complexity 
that are inconsistent among researchers. In light of mul- 
tiple levels of complexity, as well as multiple methods for 
separating units, we propose that no one definition of 
communication complexity can be universally suitable, 
and any definition has relevance only after choosing 
to which of the encoding paradigms described in 
Fig. 6 - or combination thereof - it applies. Complex- 
ity defined, say, for the Repetition paradigm (Fig. 6A) 
and quantified as pulse rate variation, is not easily 
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compared with Diversity complexity (Fig. 6B), typically 
quantified as repertoire size. 

For example, is selection from increased social com- 
plexity associated with increased vocal complexity 
(Freeberg etal., 2012; Pollard & Blumstein, 2012), 
or do some other major selective factors — such as 
sexual selection or intensity of predation — drive the 
evolution of vocal complexity? In most of the studies 
to date on vocal complexity, complexity is defined in 
terms of repertoire size (Fig. 6B). Considerable evi- 
dence in diverse taxa indicates that increased social 
complexity is associated with increased repertoire size 
(reviewed in Freeberg et al., 2012). Different views of 
complexity in this literature are revealed by the fact 
that social complexity has been measured in terms 
of group size, group stability, or information-based 
metrics of group composition, and vocal complexity 
has been measured in terms of not just repertoire size, 
but also information-based metrics of acoustic variation 
in signals. In fact, the work of Pollard & Blumstein 
(2011) is highly informative to questions of complexity, 
in that different metrics of social complexity can drive 
different metrics of vocal complexity — these authors 
have found that group size is associated with greater 
individual distinctiveness (information) in the calls 
of species, but the diversity of social roles in groups 
is more heavily associated with vocal repertoire size. 
Some researchers have proposed the idea that commu- 
nicative complexity, again defined as repertoire size, 
has at least in some species been driven by the need 
to encode more information, or redundant informa- 
tion, in a complex social environment (Freeberg et al., 
2012). Alternatively, complexity metrics that measure 
Ordering (Fig. 6D), often based on non-zero orders of 
entropy (McCowan etal., 1999; Kershenbaum, 2013), 
may be more biologically relevant in species that use 
unit ordering to encode information. Understanding 
the variety of sequence types is essential to choosing 
the relevant acoustic unit definitions, and without this, 
testing competitive evolutionary hypotheses becomes 
problematic. 


(3) How do individual differences in acoustic 
sequences arise? 


If we can develop categories for unit types and sequence 
types that lead to productive vocalisation analysis and a 
deeper understanding of universal factors of encoded 
multi-layered messages, then individual differences in 
sequence production become interesting and puzzling. 
The proximal processes driving individual differences 
in communicative sequences are rarely investigated. 
Likewise, although there is a decades-rich history of 
song-learning studies in songbirds, the ontogenetic 
processes giving rise to communicative sequences per se 
have rarely been studied. Neural models, e.g. Jin (2009) 
can provide probabilistic descriptions of sequence 
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generation (e.g. Markov models, hidden Markov mod- 
els), but the nature of the underlying stochasticity is 
unknown. When an appropriate choice of a model for 
sequence structure is made, quantitative comparisons 
can be carried out between the parameters of different 
individuals, for example with the California thrasher 
Toxostoma redivivum (Sasahara et al., 2012). However, 
model fitüng is only valid if unit selection is biologi- 
cally appropriate (Section III). Other, more abstract, 
questions can also be addressed. Individual humans use 
language with varying degrees of efficiency, creativity, 
and effectiveness. Shakespearean sequences are radi- 
cally unlike Haiku sequences, political speeches, or the 
babbling of infants, in part because their communica- 
tive purposes differ. While sexual selection and survival 
provide some purposive contexts through which we can 
approach meaning, additional operative contexts may 
suggest other purposes, and give us new frameworks 
through which to view vocal sequences (Waller, 2012). 
In many animals, song syntax may be related to sexual 
selection. Females of some species such as zebra finches 
Taeniopygia guttata not only prefer individuals with 
longer songs, but also songs comprising a greater vari- 
ety of syllables (Searcy & Andersson, 1986; Neubauer, 
1999; Holveck et al., 2008); whereas in other species, 
this preference is not observed (Byers & Kroodsma, 
2009). Variation in syntax may also reflect individual 
differences in intraspecific aggression, for instance in 
banded wrens Pheugopedius pleurostictus (Vehrencamp 
et al., 2007) and western populations of song sparrows 
Melospiza melodia (Burt, Campbell & Beecher, 2001). 
Individual syntax may also serve to distinguish neigh- 
bours from non-neighbours in song sparrows (Beecher 
et al., 2000) and skylarks Alauda arvensis (Briefer et al., 
2008). Male Cassin's vireos Vireo cassinii can usually be 
discriminated by the acoustic features of their song, 
but are discriminated even better by the sequences of 
phrases that they sang (Arriaga et al., 2013). 


(4) What is the role of sequence dialects in speciation? 


In a few species, geographic syntactic dialects (Nettle, 
1999) have been demonstrated, including primates, 
such as Rhesus monkeys Macaca mulatta (Gouzoules, 
Gouzoules & Marler, 1984) and chimpanzees Pan 
troglodytes (Arcadi, 1996; Mitani, Hunley & Murdoch, 
1999; Crockford & Boesch, 2005), birds, such as Car- 
olina chickadees Poecile carolinensis (Freeberg, 2012), 
swamp sparrows Melospiza georgiana (Liu et al., 2008) 
and chaffinches Fringilla coelebs (Lachlan et al., 2013) 
and in rock hyraxes Procavia capensis (Kershenbaum 
et al., 2012). This broad taxonomic spread raises the 
question of whether sequence syntax has a role in 
speciation (Wiens, 1982; Nevo et al., 1987; Irwin, 2000; 
Slabbekoorn & Smith, 2002; Lachlan etal, 2013), 
with some support for such a role in chestnut-tailed 
antbirds Myrmeciza hemimelaena (Seddon & Tobias, 
2007) and winter wrens Troglodytes troglodytes (Toews & 
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Irwin, 2008). It is tempting to speculate that acoustic 
sequences may have arisen from earlier selective forces 
acting on a communication system based on single 
units, with variation in the sequences of individuals 
providing differential adaptive benefit. The ability to 
communicate effectively with some but not others could 
lead to divergence of groups, and genetic pooling. Con- 
versely, differences in acoustic sequences could be 
adaptive to ecological variation. It is hard to distinguish 
retrospectively between sequence dialect shift leading 
to divergence of sub-groups and eventual speciation, or 
group separation leading to new communicative strate- 
gies that are epiphenomena of species formation. What 
are the best methods for investigating the relationship 
between communication and biological change? 

A third alternative is that sequence differences could 
arise by neutral processes analogous to drift. A complex 
interplay between production, perception, and encod- 
ing of information in sequence syntax, along with the 
large relative differences between different species in 
adaptive flexibility (Seyfarth & Cheney, 2010), could 
lead to adaptive pressures on communication structure. 
However, the definition of acoustic units is rarely con- 
sidered in this set of questions. In particular, perceptual 
binding (Fig. 4A) and the response of the focal species 
must be considered, as reproductive isolation cannot 
occur on the basis of differences that are not perceived 
by the receiver. As units may be divided at many lev- 
els, there may be multiple sequences that convey differ- 
ent information types. Thus, a deeper understanding of 
units and sequences will contribute productively to ques- 
tions regarding forces at work in speciation events. 


(5) Future directions: conclusions 


We conclude by noting that more detailed and rig- 
orous approaches to investigating animal acoustic 
sequences will allow us to investigate more complex 
systems that have not been formally studied. A number 
of directions lack even a basic framework as we have 
proposed in this review. For example, there is much to 
be learned from the detailed study of the sequences 
created by multiple animals vocalising simultaneously, 
and from the application of sequence analysis to 
multimodal communication with a combination of 
acoustic, visual, and perhaps other modalities (e.g. 
Partan & Marler, 1999; Bradbury & Vehrencamp, 2011; 
Munoz & Blumstein, 2012). Eavesdropping, in which 
non-target receivers (such as predators) gain addi- 
tional information from listening to the interaction 
between individuals, has only just begun to be stud- 
ied in the context of sequence analysis. Finally, the 
study of non-stationary systems, where the statistical 
nature of the communicative sequences changes over 
long or short time scales (such as appears to occur 
in humpback whale songs) is ripe for exploration. 
For example, acoustic sequences may be constantly 
evolving sexual displays that are stereotyped within 
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a population at any particular point in time (Payne 
& McVay, 1971; Payne, Tyack & Payne, 1983). The 
application of visual classification (Garland et al., 2011) 
and a statistical approach based on edit distance (e.g. 
Kershenbaum etal., 2012) appears to capture the 
sequential information present within humpback whale 
song (Garland etal., 2012, 2013). This work traced 
the evolution of song lineages, and the movement or 
horizontal cultural transmission of multiple different 
versions of the song that were concurrently present 
across an ocean basin over a decade (Garland etal., 
2013). These results are encouraging for the inves- 
tigation of complex non-stationary systems; however, 
further refinement of this approach is warranted. 
We encourage researchers in these fields to extend 
treatments such as ours to cover these more complex 
directions in animal communication research, thereby 
facilitating quantitative comparisons between fields. 


VII. CONCLUSIONS 


(1) The use of acoustic sequences by animals is 
widespread across a large number of taxa. As diverse 
as the sequences themselves is the range of analytical 
approaches used by researchers. We have proposed a 
framework for analysing and interpreting such acoustic 
sequences, based around three central ideas of under- 
standing the information content of sequences, defin- 
ing the acoustic units that comprise sequences, and 
proposing analytical algorithms for testing hypotheses 
on empirical sequence data. 

(2) We propose use of the term ‘meaning’ to refer to 
a feature of communication sequences that influences 
behavioural and evolutionary processes, and the term 
‘information’ to refer to the non-random statistical 
properties of sequences. 

(3) Information encoding in acoustic sequences can 
be classified into six non-mutually exclusive paradigms: 
Repetition, Diversity, Combination, Ordering, Overlapping, 
and Timing. 

(4) The constituent units of acoustic sequences can be 
classified according to production mechanisms, percep- 
tion mechanisms, or analytical properties. 

(5) Discrete acoustic units are often delineated by 
silent intervals. However, changes in the acoustic prop- 
erties of a continuous sound may also indicate a tran- 
sition between discrete units, multiple repeated sounds 
may act as a discrete unit, and more complex hierarchi- 
cal structure may also be present. 

(6) We have reviewed five approaches used for 
analysing the structure of animal acoustic sequences: 
Markov chains, hidden Markov models, network mod- 
els, formal grammars, and temporal models, discussing 
their use and relative merits. 

(7) Many important questions in the behavioural 
ecology of acoustic sequences remain to be answered, 
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such as understanding the role of communication com- 
plexity, including multimodal sequences, the potential 
effect of communicative isolation on speciation, and the 
source of syntactic differences among individuals. 
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